Cough Detection Using Acceleration Signals and Deep Learning Techniques

: Cough is a frequent symptom in many common respiratory diseases and is considered a predictor of early exacerbation or even disease progression. Continuous cough monitoring offers valuable insights into treatment effectiveness, aiding healthcare providers in timely intervention to prevent exacerbations and hospitalizations. Objective cough monitoring methods have emerged as superior alternatives to subjective methods like questionnaires. In recent years, cough has been monitored using wearable devices equipped with microphones. However, the discrimination of cough sounds from background noise has been shown a particular challenge. This study aimed to demonstrate the effectiveness of single-axis acceleration signals combined with state-of-the-art deep learning (DL) algorithms to distinguish intentional coughing from sounds like speech, laugh, or throat noises. Various DL methods (recurrent, convolutional, and deep convolutional neural networks) combined with one-and two-dimensional time and time–frequency representations, such as the signal envelope, kurtogram, wavelet scalogram, mel, Bark, and the equivalent rectangular bandwidth spectrum (ERB) spectrograms, were employed to identify the most effective approach. The optimal strategy, which involved the SqueezeNet model in conjunction with wavelet scalograms, yielded an accuracy and precision of 92.21% and 95.59%, respectively. The proposed method demonstrated its potential for cough monitoring. Future research will focus on validating the system in spontaneous coughing of subjects with respiratory diseases under natural ambulatory conditions.


Introduction
Cough has been the subject of study in several disciplines of medicine and is considered a common healthcare problem with an associated high socioeconomic burden [1].Chronic cough affects approximately 10% of the adult population and can severely impair quality of life [2].Cough is the most common symptom that causes people to seek medical attention [3].However, cough is considered not only a symptom but also a predictive indicator of acute adverse health events [4].It is connected with many common respiratory diseases, such as lung cancer, tuberculosis, chronic pulmonary obstructive disease (COPD), asthma, and infections affecting the lower respiratory tract, including coronavirus disease 2019 (COVID-19); and it is associated with exacerbations, impaired lung function, and risk of death [5,6].In addition, cough has been associated with psychological morbidity [7], and its assessment is also important for evaluating the therapy response [8].
Although quantifying cough frequency is useful as a marker of cough severity and would allow the assessment of cough as a trigger for adverse health events, cough frequency monitoring has not been incorporated into routine clinical practice and continues to be a research instrument [9].Among the barriers indicated in the recent literature, it has been suggested that cough telemonitoring systems need to be refined and validated for optimal clinical utility [10].
The measurement and monitoring of cough have been conventionally carried out using questionnaires, where the patients self-report the frequency and severity of their cough episodes.This is the case of the Leicester cough questionnaire (LCQ), the cough symptom score (CSS), and the cough-specific quality of life questionnaire (CQLW) [11].However, the collection of signs and symptoms using questionnaires is subjective since it is biased by patient perception [12].Available objective methods for assessing cough have been limited to cough sensitivity assessments, which require chemical or mechanical stimulation [13].
Since the automatic detection and characterization of cough episodes has been referred to as a strategy of great interest that can improve disease monitoring [14], trends in respiratory health monitoring have shifted towards more non-invasive and automatic approaches.In the last decade, the audio signal has been used for cough characterization, using 24 h monitoring systems to count episodes (e.g., in studies of drugs to treat cough).One or more body-worn or lapel-type microphones, and sensors embedded in mobile phones have been and continue to be used for this purpose [15,16].The audio signal recorded by the microphones has been processed by trained algorithms to reduce noise, discard periods of silence, and classify cough events with a high probability of success [17].
The discrimination of cough sounds from other noise (e.g., throat clearing) and speech signals has been shown to be a particular challenge.In addition, despite its long-standing use, commercially available solutions generally still rely on a manual component [14].To overcome this limitation, sound collection using smartphones and machine learning algorithms has shown promise in effectively measuring cough [18].
As an alternative to microphones, other novel sensors like chest-laminated electronic skin (e-skin) have been proposed for reliable cough detection [19].In this line, automatic cough detection methods using accelerometers are being explored to enable continuous monitoring of cough frequency and severity.Technologies like accelerometer-based bed occupancy detection systems [20,21] or triaxial accelerometers combined with stretchable strain sensors [22] have shown promising results in accurately detecting cough episodes for long-term monitoring [23].Additionally, advancements in artificial intelligence have demonstrated promising potential for improving the accuracy and efficiency of cough detection systems [24].However, the performance of these models in realistic environments remains limited by their significant false positive rate.
This study aimed to show how an accelerometer-based system combined with stateof-the-art deep learning techniques can enable the automatic detection of cough episodes.Acceleration body-worn sensors can more effectively discriminate coughs from external sounds, and the application of artificial intelligence techniques can contribute to the mitigation of false positives.Deep learning approaches were thoroughly explored and compared using the acceleration signals (vibrations) collected using an accelerometer placed on the suprasternal notch.

Participants and Data Acquisition Protocol
To train and validate the artificial intelligence-based models, experimental data were collected from a group of 23 participants (14 men and 9 women).The main objective was the automatic detection of cough, differentiating it from other acoustic events such as speech, laugh, throat clearing, or background noise.
All subjects gave informed consent for inclusion before participating in the study.The study was conducted by the Declaration of Helsinki, and the protocol was approved by the Ethics Committee of Biomedical Research of Cádiz (Project identification code CICERONE, 29.23).
The acceleration signals of each participant were recorded in a single file using a common data structure that facilitated the subsequent separation and labeling of each signal segment.Each recording was carried out with the subject in a seated position and included 10 voluntary coughing episodes, 5 s of speech, two seconds of laughing, two seconds of throat clearing, and two pauses of silence in which the subject's breathing was recorded.Groups of different sounds were separated by two seconds of silence.A user interface was designed to guide the participants during the signal acquisition.Figure 1 shows an example of the accelerometer signal recorded during an experimental session, including cough and non-cough episodes (silence, speech, laugh, and throat clearing).

ERONE, 29.23).
The acceleration signals of each participant were recorded in a single file using a common data structure that facilitated the subsequent separation and labeling of each signal segment.Each recording was carried out with the subject in a seated position and included 10 voluntary coughing episodes, 5 s of speech, two seconds of laughing, two seconds of throat clearing, and two pauses of silence in which the subject's breathing was recorded.Groups of different sounds were separated by two seconds of silence.A user interface was designed to guide the participants during the signal acquisition.Figure 1 shows an example of the accelerometer signal recorded during an experimental session, including cough and non-cough episodes (silence, speech, laugh, and throat clearing).

Hardware
The data acquisition system comprised an accelerometer sensor, with a housing for optimal positioning, and a microprocessor system (Raspberry Pi Zero W. The Raspberry Pi Zero W unit (Broadcom BCM2835 processor with 1 GHz ARM11 core, 512 MB of RAM, and Linux operating system) was selected as the microprocessor due to its low power consumption, wireless connectivity, and small size.An SD card allowed the storage of the recorded signals.The system was powered by a rechargeable LiPo battery, connected to the microprocessor through a DC/DC converter charger.The assembly was installed in a 3D-printed plastic enclosure designed specifically for this purpose.The acceleration signals were acquired using the low-power consumption MMA8452Q 3-axis acceleration sensor (Freescale Semiconductor, Austin, TX, USA).The MMA8452Q sensor has a resolution of 12 bits and supports a low power consumption mode that extends battery life.It can measure accelerations in the range [−2g, +2g], offers I2C communications, and supports a maximum sampling rate of 800 Hz.It is often used for medical monitoring and rehabilitation by providing real-time movement data.Only the z-axis was used in this study to acquire acceleration signals perpendicular to the skin surface at the point of contact.The optimal location for the sensor placement was estimated to be around the thyroid cartilage at the front of the neck, according to a previous study by the authors [25].The sensor was attached to the participant's skin using an adhesive electrode.The connection between the sensor and the microprocessor unit was made using the I2C communications

Hardware
The data acquisition system comprised an accelerometer sensor, with a housing for optimal positioning, and a microprocessor system (Raspberry Pi Zero W. The Raspberry Pi Zero W unit (Broadcom BCM2835 processor with 1 GHz ARM11 core, 512 MB of RAM, and Linux operating system) was selected as the microprocessor due to its low power consumption, wireless connectivity, and small size.An SD card allowed the storage of the recorded signals.The system was powered by a rechargeable LiPo battery, connected to the microprocessor through a DC/DC converter charger.The assembly was installed in a 3D-printed plastic enclosure designed specifically for this purpose.The acceleration signals were acquired using the low-power consumption MMA8452Q 3-axis acceleration sensor (Freescale Semiconductor, Austin, TX, USA).The MMA8452Q sensor has a resolution of 12 bits and supports a low power consumption mode that extends battery life.It can measure accelerations in the range [−2g, +2g], offers I2C communications, and supports a maximum sampling rate of 800 Hz.It is often used for medical monitoring and rehabilitation by providing real-time movement data.Only the z-axis was used in this study to acquire acceleration signals perpendicular to the skin surface at the point of contact.The optimal location for the sensor placement was estimated to be around the thyroid cartilage at the front of the neck, according to a previous study by the authors [25].The sensor was attached to the participant's skin using an adhesive electrode.The connection between the sensor and the microprocessor unit was made using the I2C communications protocol.Figure 2 illustrates the prototype of the data acquisition system implemented in this study.
protocol.Figure 2 illustrates the prototype of the data acquisition system implemented in this study.

Signal Processing and Data Augmentation
The acceleration signals were acquired using a sampling rate of 800 Hz.The reading was carried out by the microprocessor using interruptions.The acquisition and recording process was controlled using the Python 3 programming language.The raw signal was filtered to eliminate unwanted movements of the subject and remove the noise due to the sensor rubbing against the skin.A high-pass FIR filter of order 10 with a cutoff frequency of 50 Hz was used.
The number of 1 s intervals per hour that contain a cough has been proposed as a cough monitoring index [26].Accordingly, the cough and non-cough samples were segmented manually into 1 s segments, leaving, if possible, silent intervals before and after.No overlapping was applied.Cough records of less than 1 s were completed using the zero-padding technique.
Random sequential augmentations were applied to increase the number of samples of cough events.Pitch shifting, time-scale modification, time shifting, noise addition, and amplitude control were used to increase the number of cough and non-cough samples by a factor of 10 and 2, respectively.
Table 1 shows the distribution of participants, gender, and samples in the original and augmented sets.

Signal Processing and Data Augmentation
The acceleration signals were acquired using a sampling rate of 800 Hz.The reading was carried out by the microprocessor using interruptions.The acquisition and recording process was controlled using the Python 3 programming language.The raw signal was filtered to eliminate unwanted movements of the subject and remove the noise due to the sensor rubbing against the skin.A high-pass FIR filter of order 10 with a cutoff frequency of 50 Hz was used.
The number of 1 s intervals per hour that contain a cough has been proposed as a cough monitoring index [26].Accordingly, the cough and non-cough samples were segmented manually into 1 s segments, leaving, if possible, silent intervals before and after.No overlapping was applied.Cough records of less than 1 s were completed using the zero-padding technique.
Random sequential augmentations were applied to increase the number of samples of cough events.Pitch shifting, time-scale modification, time shifting, noise addition, and amplitude control were used to increase the number of cough and non-cough samples by a factor of 10 and 2, respectively.
Table 1 shows the distribution of participants, gender, and samples in the original and augmented sets.To train the 2D convolutional neural networks (CNN), the VGGish model, and the deep convolutional neural networks (DCNNs), we used two-dimensional (2D) representations, common in audio signal processing, to capture and describe the relevant features of the acceleration signals.Representations such as mel spectrograms [27], Bark and equivalent rectangular bandwidth (ERB) spectra [28], wavelet scalograms [29], and kurtograms [30] were considered because they provide detailed insight into the temporal and frequency structure of signals.

•
Mel, ERB, and Bark spectrograms Comparing the mel, ERB, and Bark spectrograms applied to the accelerometer signal can provide a deeper understanding of how different frequency scales can influence the identification of cough patterns.This can lead to more precise and efficient detection, improving health monitoring applications.
Log-mel spectrograms are representations of the signal spectral energy, where the energy at each frequency is calculated at discrete time intervals.The mel scale is applied to the frequencies before taking the logarithm of the energy.The ERB spectrum also represents the signal spectral energy.In this case, the energy at each frequency is grouped into bands of equivalent rectangular width.The Bark spectrum provides the signal spectral energy using Bark bands.
A 25 ms Hanning window with 10 ms overlapping was used to calculate the mel, Bark, and ERB spectrograms.The FFT length was 512 and the number of bands was set to 50.

• Kurtogram, scalogram, and signal envelope
The kurtogram captures the kurtosis of a signal at different time and frequency scales.Kurtosis is a measure of the shape of the signal distribution and can provide useful information about the temporal and spectral characteristics of the signal.
The wavelet scalogram represents the signal as a function of time and frequency, using wavelets to analyze the signal characteristics at different time and frequency scales.The analytic Morse wavelet and 12 voices per octave were used in this study to calculate the continuous wavelet transform.
Finally, the upper and lower envelopes of the acceleration signal were used.The envelopes were determined using spline interpolation over local maxima separated by at least 10 samples.

One-Dimensional Spectrum Feature Extraction
Some of the artificial intelligence techniques used in this study required one-dimensional (1D) signal features as inputs.Mel Frequency Cepstral Coefficients (MFCCs), delta, and delta2 MFCCs were calculated for this purpose.These coefficients capture the signal spectral characteristics by mapping the energy of the frequency spectrum into a nonlinear mel scale.MFCCs capture information about the spectral shape and content of the signal, making them valuable for classification tasks.Delta MFCCs represent the rate of change of adjacent MFCCs over time and the delta2 MFCC is the differential of the MFCC delta coefficients (acceleration).These parameters provide information useful for distinguishing between signals with different spectral change patterns over time [31].In total, 13 MFCC, 13 delta MFCC, and 13 delta2 MFCC parameters were calculated.
In addition, spectral (spectral centroid, spectral roll-off, spectral slope, spectral spread, harmonic ratio) and time-domain features (zero-crossing rate and short-time energy) were estimated [32].To calculate the spectral parameters, a Hamming window with 50% overlapping was used.
The spectral centroid (C), roll-off, slope (m), and spread (σ) were calculated on the onesided mel spectrum.The spectral centroid and the spectral spread are statistical spectrum moments and represent the spectral gravity center and the spread of the spectrum around the mean value, respectively.The spectral slope captures the slope of the signal spectrum and reflects the energy distribution between low and high frequencies.The spectral roll-off indicates the frequency f K below which 95% of the total spectral energy is contained.These parameters were calculated according to Equations ( 1)-( 4): where f i are the mel scale frequencies, S i is the amplitude of the spectrum at the frequency f i , N is the number of points in the spectrum, and k is the spectral roll-off point (f K ).
The harmonic ratio (R H ) quantifies the proportion of harmonic (periodic) content to the total energy in the signal and was calculated using the normalized signal autocorrelation according to [33].
The short-time energy was computed in the time domain according to Equation ( 5): where x w (n) is the windowed signal.The zero-crossing rate was estimated as the rate at which a signal changed from positive to zero to negative or from negative to zero to positive.
To assess the effect of the window size, the performance of models for window sizes that ranged from 25 ms to 400 ms was explored.In total, 46 1D-features were calculated for each sample.Features were standardized.

Deep Learning Models
In the last decade, there has been a growing interest in applying artificial intelligence techniques for respiratory sound classification tasks such as cough detection and respiratory disorder identification [34][35][36][37].In general, deep learning models have demonstrated great potential to achieve an accuracy and performance higher than those of traditional machine learning models.
In this study, four categories of deep learning network models were used as binary classifiers capable of detecting cough events: recurrent neural networks (RNNs), convolutional neural networks (CNNs), the VGGish network, and deep convolutional neural networks (DCNNs).Figure 3 shows the time-frequency features used for each DL family.

Recurrent Neural Networks (RNNs)
RNNs, despite their sequential nature, can be deep if they have multiple layers.Deep RNNs have multiple recurrent layers stacked on top of each other.RNNs process input data step-by-step and have loops within their structure, allowing information to persist over time.This makes these models suitable for tasks such as time series prediction, natural language processing, and voice disorder detection [38].
The Long Short-Term Memory (LSTM) model was used in this study.The LSTM architecture has been used in the context of cough sounds, tuberculosis, and COVID-19 with promising results [39,40].An LSTM model is a type of RNN designed to handle the vanishing gradient problem.It is capable of learning long-term dependencies in sequential data, making it suitable for processing sequences of cough sounds over time.
The LSTM model used in this study was designed with an LSTM layer and two fully connected layers (Figure 4).The input layer is fed with the 46 spectral and time domain features extracted from each 1 s segment of the acceleration signal, as described in Section 2. 4  RNNs, despite their sequential nature, can be deep if they have multiple layers.Deep RNNs have multiple recurrent layers stacked on top of each other.RNNs process input data step-by-step and have loops within their structure, allowing information to persist over time.This makes these models suitable for tasks such as time series prediction, natural language processing, and voice disorder detection [38].
The Long Short-Term Memory (LSTM) model was used in this study.The LSTM architecture has been used in the context of cough sounds, tuberculosis, and COVID-19 with promising results [39,40].An LSTM model is a type of RNN designed to handle the vanishing gradient problem.It is capable of learning long-term dependencies in sequential data, making it suitable for processing sequences of cough sounds over time.
The LSTM model used in this study was designed with an LSTM layer and two fully connected layers (Figure 4).The input layer is fed with the 46 spectral and time domain features extracted from each 1 s segment of the acceleration signal, as described in Section 2.4.2.The second fully connected layer feeds into a SoftMax classifier with two class labels.

Convolutional Neural Networks (CNNs)
A CNN is a type of feed-forward neural network with convolutional layers followed by pooling layers to learn local features in the input data.CNNs have a relatively straightforward architecture and are primarily used for image recognition and classification tasks.RNNs, despite their sequential nature, can be deep if they have multiple layers.Deep RNNs have multiple recurrent layers stacked on top of each other.RNNs process inpu data step-by-step and have loops within their structure, allowing information to persis over time.This makes these models suitable for tasks such as time series prediction, nat ural language processing, and voice disorder detection [38].
The Long Short-Term Memory (LSTM) model was used in this study.The LSTM ar chitecture has been used in the context of cough sounds, tuberculosis, and COVID-19 with promising results [39,40].An LSTM model is a type of RNN designed to handle the van ishing gradient problem.It is capable of learning long-term dependencies in sequentia data, making it suitable for processing sequences of cough sounds over time.
The LSTM model used in this study was designed with an LSTM layer and two fully connected layers (Figure 4).The input layer is fed with the 46 spectral and time domain features extracted from each 1 s segment of the acceleration signal, as described in Section 2.4.2.The second fully connected layer feeds into a SoftMax classifier with two class labels

Convolutional Neural Networks (CNNs)
A CNN is a type of feed-forward neural network with convolutional layers followed by pooling layers to learn local features in the input data.CNNs have a relatively straightforward architecture and are primarily used for image recognition and classification tasks.However, they have been adapted for processing other data types, including cough sounds [40,41].
Generally, a CNN model contains two main parts: a convolution layer and a fully connected layer.The convolution layer performs feature extraction.In contrast, the fully connected layer utilizes the output of the convolution layer to generate a compelling image description.
In this study, the CNN model detailed in Figure 5 was used for the cough classification task.This architecture has been proposed recently for heart sound classification tasks [42].It includes five convolutional layers and one fully connected layer.The first three convolutional layers are followed by overlapping max-pooling layers.The fifth convolutional layer is linked to the fully connected layer, which feeds into a SoftMax classifier with two class labels.ERB, MEL, and Bark spectrograms were used as input images (98 × 50) to this CNN model.tion task.This architecture has been proposed recently for heart sound classification tasks [42].It includes five convolutional layers and one fully connected layer.The first three convolutional layers are followed by overlapping max-pooling layers.The fifth convolutional layer is linked to the fully connected layer, which feeds into a SoftMax classifier with two class labels.ERB, MEL, and Bark spectrograms were used as input images (98 × 50) to this CNN model.

Mixed RNN-CNN Approach: VGGish
VGGish is a deep learning model that utilizes both convolutional and recurrent neural network architectures.It was developed by Google for audio feature extraction and is based on the VGG (Visual Geometry Group) architecture.The VGGish model operates on log-mel spectrograms and uses 2D convolutional, pooling, and fully connected layers to extract relevant features from signal spectrograms (Figure 6).VGGIsh has been used very recently as a feature extractor for COVID-19 cough classification [43].The pre-trained VGGish model was used and fine-tuned by letting it iterate on our dataset.

Mixed RNN-CNN Approach: VGGish
VGGish is a deep learning model that utilizes both convolutional and recurrent neural network architectures.It was developed by Google for audio feature extraction and is based on the VGG (Visual Geometry Group) architecture.The VGGish model operates on log-mel spectrograms and uses 2D convolutional, pooling, and fully connected layers to extract relevant features from signal spectrograms (Figure 6).VGGIsh has been used very recently as a feature extractor for COVID-19 cough classification [43].The pre-trained VGGish model was used and fine-tuned by letting it iterate on our dataset.Before feeding the VGGish model, the mechano-acoustic signals were 16,000 Hz, and an array of mel spectrograms was computed to produce feature vector (96 × 64) for each signal segment.The window size was 25 m lap percentage between consecutive mel spectrograms was 50%.These f were used as inputs to tune a pre-trained VGGish classifier.Before feeding the VGGish model, the mechano-acoustic signals were resampled to 16,000 Hz, and an array of mel spectrograms was computed to produce a fixed-length feature vector (96 × 64) for each signal segment.The window size was 25 ms and the overlap percentage between consecutive mel spectrograms was 50%.These feature vectors were used as inputs to tune a pre-trained VGGish classifier.

Deep Convolutional Neural Networks (DCNNs)
DCNNs refer specifically to CNN architectures with a significant number of layers, which are capable of learning highly abstract and complex features from raw data.DCNNs typically consist of many convolutional layers, sometimes interspersed with pooling layers, and followed by fully connected layers.In this study, we trained and validated four DCNN networks, namely ResNet-50, SqueezeNet, Inception-Resnet-V2, and MobileNet-V2.
ResNet-50 is a DCNN that uses residual connections for extracting complex features from input data.SqueezeNet was designed to be lightweight and computationally efficient, using 1 × 1 convolutions to reduce the number of parameters, and enabling efficient implementation on resource-constrained devices.Inception-Resnet-V2 architecture combines the idea of Inception modules with residual connections, allowing for more efficient learning of complex features at different scales.Finally, the MobileNet-V2 architecture utilizes depth-wise separable convolutions to achieve a balance between accuracy and efficiency.These networks have been very recently used for tuberculosis and COVID-19 detection using cough sounds [40,44,45].We take the pre-trained networks and use them as a starting point to learn the task of cough detection using transfer learning.

Validation
Leave-One-Subject-Out Cross-Validation (LOSOCV) was applied to evaluate the performance of deep learning models.LOSOCV is commonly used in contexts involving subject-specific data and is based on dividing the dataset by subjects, ensuring that all data from one subject is used as the test set, while data from the remaining subjects are used as the training set.This process was repeated for each subject in the dataset, ensuring that each subject's data were used once for testing.The average model performance across all iterations was then calculated to obtain the overall metrics.
All deep learning models were trained and validated using MATLAB R2023a and a 12th Gen Intel(R) Core (TM) i7-12700K 3.60 GHz, equipped with an NVIDIA GeForce RTX 3070Ti GPU.Some tests were carried out to evaluate the hyperparameters that led to stable training processes and high-performance metrics.The learning rate and the batch size were set to 1 × 10 −4 and 64, respectively.For each training run, the model version with the best validation loss was stored and evaluated.

Performance Metrics
To ensure consistency and generalization of each resulting model, metrics widely used to measure the performance in machine learning classifiers were calculated.These metrics included accuracy (Acc), precision (Pr), sensitivity (Se), specificity (Sp), false positive rate (FPR), and the Cohen's Kappa index.Acc, Pr, Se, Sp, and FPR were calculated using the confusion matrix (Equations ( 6)-( 10)).The Cohen's Kappa index was estimated using Equations ( 11) and (12).Table 2 shows the confusion matrix for a binary classification problem.
Ideally, the FPR should be low to prevent a significant number of non-cough episodes from being detected as coughs.The closer the rest of the metrics are to 100%, or 1 in the case of the Kappa index, the better the model performance.
All metrics were determined with 95% confidence intervals (CIs).The bootstrap approach was used to estimate the Kappa CI [46].For the rest of the metrics, the Clopper-Pearson method was applied [47].

Recurrent Neural Networks (RNN)
Table 3 summarizes the results obtained using the LSTM model.The accuracy varied with the window size, with the best result (Acc = 89.53%)obtained for a window size of 25 ms.

Convolutional Neural Networks (CNN)
Figure 7 shows some examples of spectrograms of cough and non-cough signals.Table 4 summarizes the results obtained using the CNN model, when using the mel, ERB, and Bark spectrograms.Different window sizes were evaluated, and the best results were achieved for a window size of 25 ms.The accuracy and precision in the best case were 89.13% and 91.35% (ERB spectrogram).

Convolutional Neural Networks (CNN)
Figure 7 shows some examples of spectrograms of cough and non-cough signals.Table 4 summarizes the results obtained using the CNN model, when using the mel, ERB, and Bark spectrograms.Different window sizes were evaluated, and the best results were achieved for a window size of 25 ms.The accuracy and precision in the best case were 89.13% and 91.35% (ERB spectrogram).

Mixed RNN-CNN approach: VGGish
A VGGish pre-trained network was fine-tuned using transfer learning and fed with the log-mel spectrograms calculated using each acceleration signal segment.Table 5 summarizes the validation results.An accuracy of 90.35% was achieved.

Deep Convolutional Neural Networks (DCNNs)
Figure 8 shows the mel spectrogram, the kurtogram, the scalogram, and the signal envelope of some cough and non-cough signals.summarize the results ob-tained using all the proposed DCNN models (ResNet-50, SqueezeNet, Inception ResNetV2, MobileNet-V2) and the proposed 2D features.In the case of the mel spectrogram, the best result was achieved with the SqueezeNet model (accuracy of 90.02%).In the case of the kurtogram, the best result was obtained using the Inception ResnetV2 architecture, which led to an accuracy of 79.99%.The best performance using the scalogram was found in the SqueezeNet model, which reached an accuracy of 92.21%.Finally, the signal envelope provided the best result with the SqueezeNet model, with an accuracy of 89.67%In the case of the mel spectrogram, the best result was achieved with the SqueezeNet model (accuracy of 90.02%).In the case of the kurtogram, the best result was obtained using the Inception ResnetV2 architecture, which led to an accuracy of 79.99%.The best performance using the scalogram was found in the SqueezeNet model, which reached an accuracy of 92.21%.Finally, the signal envelope provided the best result with the SqueezeNet model, with an accuracy of 89.67% Table 10 summarizes the best results obtained among all the DL models evaluated.As shown, mel and ERB spectrograms, as well as wavelet scalograms combined with CNN, DCNN, and mixed CNN + RNN models, operated with high performance in the classification of cough events, discriminating them with a high degree of efficiency from other sounds such as speech, laugh, or voice-clearing noises.Table 10 summarizes the best results obtained among all the DL models evaluated.As shown, mel and ERB spectrograms, as well as wavelet scalograms combined with CNN, DCNN, and mixed CNN + RNN models, operated with high performance in the classification of cough events, discriminating them with a high degree of efficiency from other sounds such as speech, laugh, or voice-clearing noises.
In particular, the SqueezeNet model with wavelet scalograms with an accuracy of 92.21% and the VGGish architecture, which used mel spectrograms, with an accuracy of 90.35%, outperformed the other models.The performance of the CNN and RNN models was also noteworthy, with accuracies greater than 89%.

Discussion
In this work, we presented an approach for the automatic classification of cough events using a single sensor single signal system.Deep learning techniques were applied to the signal recorded by a digital accelerometer sensor located in the suprasternal notch.Oneand two-dimensional time-frequency features were thoroughly explored.
Although cough assessment has been carried out conventionally using microphones [48], the discrimination of cough and non-cough activities remains a challenge [49].For this purpose, the use of accelerometers has been proposed as a robust alternative.The use of accelerometers for the quantification of biomedical signals such as heart rate and variability, respiratory rate, or snoring is widespread in the scientific literature and presents certain benefits over conventional microphones [50].The sampling rate is significantly lower, which substantially reduces the computational burden in most DL models.This enables the use of complex deep learning models that can be deployed on portable devices.In addition, sensitivity to external background noise is significantly reduced, given the coupling of the sensor to the patient's body.Finally, privacy is enhanced since the continuous monitoring does not involve recording audio signals.
However, unlike the processing of sounds recorded by microphones to detect coughing, the use of current deep learning techniques applied to signals recorded by acceleration sensors to detect cough has been little explored.In [51], an accelerometer system was used to evaluate a self-contained, ambulatory cough monitor using conventional signal processing techniques.More recently, the use of deep learning techniques together with accelerometry signals from a smartphone placed at the patient's bedside has been assessed [52].Bedmounted sensors represent a good option for monitoring nocturnal cough but are not valid for assessing cough in ambulatory conditions.In [49], a dual accelerometry was placed on the patient's neck, and 1D temporal, time-frequency, frequency, and information-theoretic features were used with artificial neural networks and support vector machines, achieving an accuracy of 90.2% in detecting voluntary cough.
In our study, a single-axis signal from an accelerometer placed on the patient's suprasternal notch was used to screen cough.DL models of four architectures were evaluated, and the performance of 1D and 2D time-frequency representations used as input to these models was thoroughly assessed.Models that offer a variety of approaches to cough signal classification were explored, from recurrent networks capturing temporal sequences, to deep and lightweight convolutional networks extracting relevant features from signal spectrograms.
Overall, all the DL architectures and 1D/2D representations evaluated showed high performance.The DCNN models using the kurtogram as input showed the lowest efficiency.The kurtogram is a useful tool for detecting signal transients, but several aspects could limit its effectiveness in the specific application for cough detection using acceleration signals.
The kurtogram offers a particular resolution that may not be best suited for all types of transients present in a cough.It appears that this resolution does not adequately capture the distinctive features of the cough.However, the 1D spectral and temporal features showed discriminative ability with the LSTM model (RNN).The same was observed for the ERB, mel, and Bark spectrograms with CNN networks, which allow a better representation of transients and frequency components relevant to the coughing process.These 1D and 2D representations can provide a detailed plot in both the time and frequency domain, which is crucial for capturing the transient characteristics of a cough.
The signal envelope captures the slow variation of the signal amplitude, eliminating the high-frequency components.This is particularly useful for identifying high-energy events such as coughing but does not provide the same detailed information in the frequency domain as spectrograms, which may hinder its accuracy.However, both the scalograms and the coefficients derived from mel spectrograms were found to be particularly effective.Mel spectrograms are widely used in pattern recognition and deep learning due to their effectiveness in extracting relevant features for analyzing acoustic signals.These scales apply nonlinear transformations that may be more effective in capturing the discriminative characteristics of complex signals such as acceleration signals linked to coughing.Scalograms use wavelet transforms.Unlike other transforms, wavelets allow better resolution in both domains, efficiently capturing both transients and stationary components, which may underlie the success of models that used these 2D features.
The selection of the optimal model depends on factors such as the required accuracy, available computational resources, and deployment context.The SqueezeNet model with wavelet scalograms and the VGGish model with mel spectrograms performed consistently well, with high accuracy (92.21% and 90.35%, respectively) and specificity (94.70% and 94.13% respectively).Both models produced a low false positive ratio (lower than 6%), indicating high robustness to overcome the main barrier in cough detection (false positives due to background noise or other similar sounds such as voice, laughing, or throat clearing).
The performance of the SqueezeNet model was outstanding.It is a compact architecture designed to be efficient in terms of model size and computational requirements while maintaining high performance in classification tasks.It can make inferences faster, which is crucial for real-time applications in devices with limited computational performance (e.g., smartphones, embedded systems, and wearables).
Finally, this study presents limitations.Firstly, a larger sample size is required to validate the results.Secondly, voluntary coughs present different patterns from spontaneous coughs [53].However, the ability to detect voluntary coughs under acoustically challenging ambient conditions is a first step toward a clinically applicable ambulatory cough monitoring system [54].Consequently, since the cough episodes used in this study were collected from healthy subjects and were simulated, future work must focus on validating the system in spontaneous coughing of subjects with respiratory diseases under natural ambulatory conditions.In addition, the next steps include the discrimination of wet and dry cough and the study of gender differences in cough patterns.

Conclusions
In this study, we evaluated a cough detection system using a digital accelerometer to capture single-axis acceleration signals.The system's ability to distinguish voluntary coughs from background noises was extensively tested using DL techniques and time-frequency representations.The most effective approaches involved scalograms and mel spectrograms and were achieved with the SqueezeNet and the VGGish models, respectively.The results obtained show a promising potential for cough monitoring.

Figure 1 .
Figure 1.An example of the accelerometer signal recorded in one of the experimental sessions.The different types of signals recorded over time in each experiment (cough, speech, laugh, and clearing voice noise) are labeled.

Figure 1 .
Figure 1.An example of the accelerometer signal recorded in one of the experimental sessions.The different types of signals recorded over time in each experiment (cough, speech, laugh, and clearing voice noise) are labeled.

Figure 2 .
Figure 2. Detail of the prototype for signal acquisition.

Figure 2 .
Figure 2. Detail of the prototype for signal acquisition.

Figure 3 .
Figure 3. Deep learning models, features, and time-frequency representations used in the study.

Figure 3 .
Figure 3. Deep learning models, features, and time-frequency representations used in the study.

Figure 5 .
Figure 5. Architecture of the CNN model.

Figure 5 .
Figure 5. Architecture of the CNN model.

Electronics 2024 ,Figure 6 .
Figure 6.The architecture of the VGGish model.

Figure 7 .
Figure 7. Examples of ERB, mel, and Bark spectrograms for cough and non-cough signals.Figure 7. Examples of ERB, mel, and Bark spectrograms for cough and non-cough signals.

Figure 7 .
Figure 7. Examples of ERB, mel, and Bark spectrograms for cough and non-cough signals.Figure 7. Examples of ERB, mel, and Bark spectrograms for cough and non-cough signals.

Table 4 .
Performance achieved by the CNN classifier using the proposed 2D feature extraction techniques (mel, ERB, and Bark spectrograms).Acc: accuracy; Pr: precision; Se: sensitivity; Sp: specificity; FPR: false positive rate.All metrics are expressed as percentages (%) and 95% confidence intervals (95% CI) are provided.The best accuracy is in bold.

Figure 8 .
Figure 8. Examples of mel, kurtogram, wavelet scalogram, and signal envelope for cough and non-cough signals.

Table 1 .
Distribution of participants, recordings, and samples in the original and augmented sets.

Table 1 .
Distribution of participants, recordings, and samples in the original and augmented sets.

Table 2 .
Confusion matrix.TP) and true negative (TN) represent the number of samples in the positive class and negative class that are correctly classified, respectively.False negative (FN) and false positive (FP) represent the samples that are incorrectly classified in the positive and negative class.FN) * (TP + FP) + (FP + TN) * (FN + TN)

Table 6 .
Performance achieved by the DCNNs with mel spectrograms as input.Acc: accuracy; Pr: precision; Se: sensitivity; Sp: specificity; FPR: false positive rate.All metrics are expressed as percentages (%) and 95% confidence intervals (95% CI).The best accuracy is in bold.

Table 7 .
Performance achieved by the deep DCNNs with the kurtogram as input.Acc: accuracy; Pr: precision; Se: sensitivity; Sp: specificity; FPR: false positive rate.All metrics are expressed as percentages (%) and 95% confidence intervals (95% CI).The best accuracy is in bold.

Table 8 .
Performance achieved by the DCNNs with the wavelet scalogram as input.Acc: accuracy; Pr: precision; Se: sensitivity; Sp: specificity; FPR: false positive rate.All metrics are expressed as percentages (%) and 95% confidence intervals (95% CI).The best accuracy is in bold.

Table 9 .
Performance achieved by the DCNNs with the signal envelope as input.Acc: accuracy; Pr: precision; Se: sensitivity; Sp: specificity; FPR: false positive rate.All metrics are expressed as percentages (%) and 95% confidence intervals (95% CI).The best accuracy is in bold.

Table 8 .
Performance achieved by the DCNNs with the wavelet scalogram as input.Acc: accuracy; Pr: precision; Se: sensitivity; Sp: specificity; FPR: false positive rate.All metrics are expressed as percentages (%) and 95% confidence intervals (95% CI).The best accuracy is in bold.

Table 10 .
Summary of the best results.Acc: accuracy; Pr: precision; Sp: specificity; FPR: false positive rate.All metrics are expressed as percentages (%) and 95% confidence intervals (95% CI).The best accuracy is in bold.