Detection of COVID-19 from Deep Breathing Sounds Using Sound Spectrum with Image Augmentation and Deep Learning Techniques

: The COVID-19 pandemic is one of the most disruptive outbreaks of the 21st century considering its impacts on our freedoms and social lifestyle. Several methods have been used to monitor and diagnose this virus, which includes the use of RT-PCR test and chest CT/CXR scans. Recent studies have employed various crowdsourced sound data types such as coughing, breathing, sneezing, etc., for the detection of COVID-19. However, the application of artiﬁcial intelligence methods and machine learning algorithms on these sound datasets still suffer some limitations such as the poor performance of the test results due to increase of misclassiﬁed data, limited datasets resulting in the overﬁtting of deep learning methods, the high computational cost of some augmentation models, and varying quality feature-extracted images resulting in poor reliability. We propose a simple yet effective deep learning model, called DeepShufNet, for COVID-19 detection. A data augmentation method based on the color transformation and noise addition was used for generating synthetic image datasets from sound data. The efﬁciencies of the synthetic dataset were evaluated using two feature extraction approaches, namely Mel spectrogram and GFCC. The performance of the proposed DeepShufNet model was evaluated using a deep breathing COSWARA dataset, which shows improved performance with a lower misclassiﬁcation rate of the minority class. The proposed model achieved an accuracy, precision, recall, speciﬁcity, and f-score of 90.1%, 77.1%, 62.7%, 95.98%, and 69.1%, respectively, for positive COVID-19 detection using the Mel COCOA-2 augmented training datasets. The proposed model showed an improved performance compared to some of the state-of-the-art-methods.


Introduction
The coronavirus (COVID-19) pandemic can be described as a respiratory infection majorly caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and has infected more than 44 million individuals globally [1]. The effect of this 21st-century pandemic has negatively affected global economic activities such as finance [2], security [3], food security, education, and global peace [4], with some positive results in reducing urban pollution [5]. The influence of this virus from the alpha to the beta variant has affected both the health and the welfare status of citizens around the world [6]. The World Health Organization (WHO) declared it to be a novel coronavirus disease and named it as a Public Health Emergency of International Concern (PHEIC) on 30 January 2020 due to the easy spread and high transmission rate and communicability of this disease [7].
Previous studies have shown that some of the clinical signs of patients infected with COVID-19 are closely related to other viral upper respiratory diseases such as a respiratory syncytial virus (RSV), influenza, and bacterial pneumonia, while other common symptoms are sore throat, pleurisy, shortness of breath, dry cough, fever, headache, etc. [8]. Different tools and methods have been used for monitoring and diagnosing this virus, such as Real-Time Polymerase Chain Reaction (RT-PCR) [9], medical imaging such as computer tomography (CT) scan images [10,11], chest X-ray [12,13], and lung ultrasonography [14], as well as blood samples [15], urine [16], feces [17], etc. However, some of the limitations of previous studies include inaccuracies of results, cost implications, varying quality and reliability of available SARS-CoV-2 nucleic acid detection kits, and the insufficient number and throughput of laboratories performing the RT-PCR test, etc. [18]. Similarly, the use of medical images for diagnosis has its share of limitations, such as the cost implications of setup, and insufficient machines in hospitals for conducting timely COVID-19 screening [19]. These medical images are processed using various machine learning, deep learning [20], and other artificial intelligence methods [21], making them more effective.
Recently, the use of respiratory sound or human audio samples such as coughing, breathing, counting, and vowel sounds for the detection of COVID-19 [22][23][24], are being presented as alternative, simple and inexpensive methods for monitoring the disease. Sound or audio classification tasks have continued to increase thanks to their wide span of applications in our everyday lives, including medical diagnostics for cognitive decline [25] and laryngeal cancer [26]. The concept of sound or audio recognition involves recognizing the audio stream, related to various environmental sounds. Thus, the advancement of deep convolutional neural network (CNN) applications in sound classification have shown very impressive performances. This is based on the strong capabilities of deep CNN architectures in identifying key features that are mapping audio spectrograms to relative or different sound labels such as the time and frequency energy modulation patterns over spectrogram data inputs [27]. The need for a deep CNN model in sound classification is due to some of the challenges posed by conventional machine learning methods, which include the inability to effectively identify features in Spectro-temporal patterns for different sounds [28]. The recent adoption of a deep network is basically due to its stronger representational ability, thereby achieving better classification performance [29]. However, the real-life applications of deep neural networks suffer from overfitting, which is always a result of limited datasets (data starvation), class imbalance, and the challenges of proper annotations in many practical scenarios due to the cost and time complexity of carrying out such annotations [30]. In addition to these challenges, there are also some shortcomings in traditional audio features techniques, such as Mel Frequency Cepstral Coefficients (MFCC), which is the problem of identifying important features within different audio/sounds for efficient classification. Therefore, alternative methods such as cochleagrams are sought for audio feature extraction [31].
CNNs are effective at learning from images. Deep CNNs are particularly well suited to the problem of sound classification for two reasons: first, when used with spectrogram-like image inputs, they can capture energy modulation patterns across time and frequency, which has been shown to be a key characteristic for differentiating between different sounds [32]. Deep CNNs are particularly suited for sound classification because they can learn discriminative spectro-temporal patterns [27]. The human body is too complex for performing effective classification, making it difficult to spot data s underlying patterns. The introduction of image-based sound classification allows for the efficient recording of a variety of sound patterns, including those coming from the heart and lungs [33,34]. However, in many situations, data augmentation is required to accomplish generalization [35].
Data augmentation has consistently shown its relevance in improving data generalization based on the application of one or more deformations properties in a set of labelled training samples, thus generating additional training data samples. Some of the most effective data augmentation methods proposed in existing studies for audio/sound dataset include the following: semantics-preserving deformations in music datasets, random time-shifting [36], pitch shifting and time stretching, etc. Some of the traditional data augmentation techniques have proven to be insufficient in other sound datasets with very high time complexity for training, and to have an insignificant impact on the performance of some state-of-the-art models [27]. Wang et al. [30] applied GAN-based semi-supervised learning using a low-density sample annealing scheme for generating a new fake audio spectrogram with labelled IFER data. Other studies also adopted image augmentation techniques for increasing spectrogram images. Mushtaq et al. [37] applied some of the most widely used image augmentation techniques on the converted audio files to spectrogram images. The authors also applied five of the most popular deformation approach to the audio files which include the pitch shift, time stretch, trim silence, etc. The study concluded that their proposed data augmentation method improved the performance of the DCNN model more than the traditional image augmentation methods with increasing accuracy for training, validation and test datasets. Based on some of the findings deduced from recent studies, we can agree that the combination of appropriate feature extraction methods with deep learning models using suitable data augmentation technique(s) can aid the performance of classifiers in sound classification. Therefore, this paper introduces effective and improved data augmentation schemes on deep learning models for sound record classification in COVID-19 detection.
In summary, the main contributions of our study are as follows: Firstly, applied simple and effective data augmentation schemes for efficient data generalizations for COVID-19 detection. Secondly, a pre-trained CNN architecture called DeepShufNet was analyzed and evaluated. The experimental analysis of the augmented datasets in comparison with baseline results showed significant improvement in performance metrics, better data generalization and enhanced optimal test results. In addition, we compared and investigated the impact of data augmentation on two methods (Mel-spectrograms and GFCC) for the detection of COVID-19 symptomatic cases, positive asymptotic cases, and fully recovered cases. The results showed an impressive result with near-optimal performance, especially in the rate of recall, precision, and F1-Score. The remaining part of this paper is sectioned as follows.
The related work is presented in Section 2, where we discuss in detail all significant approaches used for data augmentation, and learning classifiers concerning audio/sound classification. In Section 3, an introduction to our proposed methodology is fully discussed with emphasis on the dataset used, as well as our proposed data augmentation and deep learning methods. Detailed results from and discussions on the comparison of the proposed method with others' published results are presented in Section 4. In Section 5, conclusive remarks are given.

Related Work
This section discusses in detail some of the state-of-the-art methods used by previous researchers for data augmentation techniques and classification models in sound/audio classification. Research trends in COVID-19 detection include the use of conventional machine learning algorithms in sound datasets, which include but are not limited to coughing, deep breathing, sneezing, etc. Machine learning algorithms have been applied in the detection of COVID-19 with improved results, such as a study by Sharma et al. [22], who analyzed audio texture for COVID-19 detection using datasets with different sound samples and a weighted KNN classifier. Tena et al. [38] conducted COVID-19 detection using five classifiers, namely: Random Forest, SVM, LDA, LR, and Naïve Bayes algorithms. RF classifier outperformed other machine learning methods with significant improvement in the accuracy on five datasets; however, the shortfall is lower specificity rates. Chowdhury et al. [39] presented an ensemble method using the multi-criteria decision making (MCDM) method, and the best performance was obtained with extra tree classifier.
The authors of [40] applied Gaussian noise augmentation techniques and AUCORes-Net for the detection of COVID-19. Loey and Mirjalili [41] compared six deep learning architectures such as GoogleNet, ResNet 18, 50 and 101, MobileNet and NasNetmobile for detection of COVID-19 using the Coughdataset. The study shows that ResNet-18 outperforms the other models with a significant performance result. Pahar et al. [42] presented three pre-trained deep neural networks CNN, an LSTM and a Resnet50 architecture for detection of COVID-19 using five datasets. Erdogan and Narin [43] applied deep feature ResNet 50 and MobileNet architecture on support vector machine in the detection of COVID-19 and the feature extraction method used two conventional approaches, which are empirical mode decomposition (EMD) and discrete wavelet transform (DWT). The study shows a high-performance result with ResNet50 deep features. Sait et al. [44] proposed a transfer learning model called CovScanNet for classification of COVID-19 using multimodal datasets. Soltanian and Borna [45] investigated the impact of the lightweight deep learning model on classification of Covid from non-Covid cough Virufy datasets. The authors combined separable kernels in deep neural networks for COVID-19 detection.
Despotovic et al. [46] applied a CNN model based on VGGish in a Cough and Voice Analysis (CDCVA dataset) and the study gave an improved performance of 88.52% accuracy, while Mohammed et al. [47] presented shallow machine learning, Convolutional Neural Network (CNN), and pre-trained CNN models on Virufy and Coswara datasets with performance metrics showing 77% accuracy. Brown et al. [48] presented ML algorithms such as Logistic Regression (LR), Gradient Boosting Trees, and Support Vector Machines in the detection of COVID-19.
Some of the data augmentation techniques presented by previous researchers include studies by Lella and Pja [49], which applied traditional audio augmentation methods on a one-dimensional CNN for diagnosing respiratory diseases of COVID-19 using humangenerated sounds such as voice/speech, cough and breath datasets. Salamon and Bello [27] examined the impact of different data augmentation methods on the CNN model. Authors concluded that there is a need for class-conditional data augmentation for improved performance of deep learning models. Leng et al. [29] proposed a Latent Dirichlet Allocation (LDA) approach for augmentation of audio events from audio recordings. The authors compared the performance of the proposed LDA algorithm to other data augmentation techniques such as time and pitch shifting and Gaussian noise. Based on this thorough literature review, we can agree that to a great extent, existing data augmentation and classification methods in COVID-19 using sound/audio datasets still suffer from setbacks in identifying an appropriate and lightweight data augmentation method to overcome the problem of limited training data and data imbalance. The issue of background noise on sound datasets affects effective feature extraction; therefore, creating synthetic datasets from such noisy datasets would also affect the efficiency of the classification of deep learning models. There is a need to collect more quality data and thereby improve the performance of the learning models [38,50]. Therefore, this study proposed a simple and efficient deep learning architecture referred to as DeepShufNet model for improved classification of COVID-19. In addition, we applied effective data augmentation techniques using noise and color transformation methods in generating better synthetic datasets, thus improving data generalization and COVID-19 detection.

Dataset
This experimental study was conducted using the publicly available Coswara dataset generated by Sharma et al. [51] which consists of nine different audio/sound samples collected from 2130 recordings. The different audio/sound samples include the following: breathing (two types: deep and shallow), cough (two types: heavy or shallow), digit counting (two types: fast and normal), and finally vowel phonation (three types: a, e, and o), respectively. The audio recordings from the Coswara dataset consist of seven categories which are as follows: healthy (1372), positive_moderate (72), positive_mild (231), positive_asymp (42), recovered_full (99), respiratory_illness_not_identified (RINI) (150), and no_respiratory_illness_exposed (NRIE) (164). The summary of each category of audio samples for the entire Coswara dataset is described in Figure 1 and Table 1 summarizes the selected classes used for this study. However, in this study, our experiment is majorly focused on deep breathing audio samples (coined as COCOA-DB). The architectural framework for our proposed model is presented in Figure 2.   For this study, we merged some classes, as we will see in the next section. The reason for this merger is due to the similarity in the names and the audio spectrum; therefore, later in this study, classes such as positive mild and positive moderate were merged and represented as a COVID-19 positive class. The information of our proposed architecture model is stated in the remaining subsection and the specifics or key blocks in the architecture are discussed in detail.

Data Pre-Processing
For each audio recording file within the Coswara dataset, there is an uneven time of audio signal, and to determine the different time duration for each file, we used the expression L = (N(Y)./ f (s)), where N(Y) is the sample length and f (s) is the sampling frequency for each audio sample, which is 48 kHz. Based on the mathematical expression to calculate L s, the minimum and maximum length of audio files are 4 and 29 s, respectively. To ensure that all relevant features are captured during the analysis, we applied a simple pre-processing and normalization method by scaling speech by its peak value with an amplitude maximum value of 1 [52]. Secondly, we applied a silent region deletion method which allows the elimination of the silent part of the signal speech and uses only the voiced portion. Progress in previous studies using silent region elimination has shown its complete usefulness in improving the performance of the system and reducing processing time.

Feature Extraction
This study considered three categories of audio signal features, which are based on Mel-spectrograms, and the Gammatone Frequency Cepstral Coefficient (GFCC) image described below as well as a sample of generated images for each class using the two feature extraction methods is depicted in Figures 3 and 4.

Mel Spectrogram
One of the most widely used time-frequency spectra in sound classification is the Mel spectrogram [53]. This input representation has continuously shown its effectiveness and importance when compared to other structures such as short-time Fourier transform (STFT). Based on this knowledge, we transformed all our selected Coswara audio recordings into spectrograms using the default Mel spectrograms function in the MATLAB toolbox. The Mel spectrogram images were created with an FFT window, frequency range up to 2.0 × 10 4 Hz, and an average length of audio files varies ranging from 10 s to 25 s. The samples of the Mel spectrogram created are depicted in Figure 3, showing the time-frequency spectrogram for each class category in the Coswara dataset. We can also agree that the power spectral energy density P( f , t) for each audio file differs with increasing power for healthy samples in comparison with other class samples to the number of points around the spaced times t and frequencies f .

Gammatone Frequency Cepstral Coefficients (GFCC)
Gammatone Frequency Cepstral Coefficients (GFCC) was developed by Patterson et al. using Gammatone filter banks which model the auditory system of humans as an overlapping band-pass filter [54]. In the GFCC feature extraction process, the speech signal is expanded to the Gammatone filterbanks in the frequency domain. The output of the Gammatone filterbanks is used in achieving the cochleagram, which is a representation of a frequency-time signal. Therefore, the impulse response for each gammatone filter can be expressed mathematically in Equation (1).
where: m is constant (mostly equal to 1), controlling the gain and the order of the filters and is defined by the value of y, which is mostly set to a value less than 4; the bandwidth is represented as n and can be expressed in Equation (2); ∅ is the phase but is generally set to zero. The samples of the GFCC created are depicted in Figure 4, showing the time-frequency spectrum for each class category in the Coswara dataset.

Data Augmentation Scheme
After the feature extraction steps, there is a problem with the data distribution of each class, as shown in Figure 1, with a huge factor of class imbalance among the seven different classes. In the worst case, the number of samples of the majority class is an average of 10 times more than the minority classes. This factor plays a crucial role in the difficulty of the classification task and thus influences the performance of our model. Therefore, the application of data augmentation will not only provide more training data samples, or reduce the overfitting of models during the training, but it will also improve the accuracy and overall performance of the models [37]. Basically, for this study, we applied the two categories of data augmentation methods to increase the training images of the minority class as depicted in Figure 1, and as a result, we achieved a newly synthetic dataset referred to as COCOA (Table 2), COswara-COvid-Augmented datasets, which are as follows: • Color transformation method: In this category of data transformation, there are three popularly used color models in the literature; however, in this study, we adopted rgb2lab and grayscale transformation methods. These transformer methods are also referred to as monochrome simply because they are made of 256 shades of grey and have a brightness value between 0 (black) to 1 (white). In this research, several types of color transformation techniques were applied, namely, brightness, contrast, rgb2gray, and rgb2lab. Horizontal flip, zoom, and shear transforms were applied to each image in the dataset to generate a new dataset called COCOA-1.

•
Noise Addition: We applied Gaussian noise and salt-pepper noise with different parameters to each image in the dataset to generate a new synthetic dataset, called COCOA-2.
In addition to these two categories of data augmentation techniques, we also applied some of the traditional data augmentation methods such as horizontal flip, vertical flip, and random reflection to each image in the datasets. Table 3 shows the summary of the total data samples used in this study with the number of augmented samples per class. The number of synthetic data generated by each of the transformation methods using the training datasets are 1098, 760, and 760 synthetic samples for All positive COVID-19, positive asymptomatic, and recovered full classes, respectively.

Structure of The Proposed DeepShufNet Model
This study proposed the DeepShufNet model, which is a lightweight deep CNN model as shown in Figure 5. Our choice of proposed pretrained ShuffleNet architecture in this study is based on the concept of pointwise Group Convolution, which has been described in recent studies as a light-weighted network that assigns models over two GPUs and uses repeated building blocks and channel shuffle. In addition, the use of pointwise group convolution and channel shuffling has helped in minimizing computational cost and still improving overall accuracy. The network has been initially pre-trained using ImageNet. As audio recording is a one-dimensional time series, we train a one-dimensional convolutional neural network for binary classification. The DeepShufNet consists of an input layer of a 224 × 224 × 3 image, and multiple hidden layers which include a convolutional layer, batch normalization layer, pooling layer, flatten layer, fully connected layer, and an output layer. However, the original size of each image is 875 × 656 pixels, but this paper applied an imresize to resize all images to the size 224 × 224, which is enough to identify all target ranges. In addition, based on the literature, the use of smaller input aids in improving computational speed, reduction of the number of parameters, and finally minimizing the possibilities of overfitting. The proposed DeepShuffleNet used in our experiment has a total number of 172 layers and a total number of 1.4 million learnable parameters. We applied 50% dropout layers to the hidden neurons, which helps to prevent overfitting. Despite the larger numbers of layers, the DeepShuffleNet architecture utilizes some interesting operations such as grouped convolution, channel shuffle, and depth concatenation which significantly minimize computational complexity and improve accuracy.
In this study, we utilize the training options with Adam (adaptive momentum algorithm), and a minibatch size of 250 for searching and final training. The learning rate of our optimizer is subject to a warm start ranging from 1 × e −4 to 0.001, a total number of epochs of 50, and an L2 regularization parameter of = 2 × e −4 . To ensure the optimal training of our model and to prevent overfitting, which is a major challenge for deep neural network models, we applied a drop-out rate of 50%. The shuffleNet architectural layers are made up of 172 layers and 1.4 million total learnable parameters, as summarized in Table 4.

Performance Evaluation
In this paper, we assess the performance of our proposed method using three datasets, which are the original Coswara datasets for deep breathing recordings, COCOA1 (offline data augmentation based on time shift, pitch shift, noise), and COCOA2 (using image augmentations). We investigated the performance of our classification task using some of the state-of-the-art evaluation metrics, namely Accuracy, Recall, Precision, F1-Score, and Confusion matrix. The mathematical expression and the description of the performance metrics used in this study are represented in Table 5.

Metrics Description Mathematical Expression
Specificity Proportion of true negative (non-COVID-19) people against the actual number of people without the disease.
The weighted average of precision and recall. F1 − Score = 2 * Sen * Spec Sen + Spec

Experimental Results and Discussion
This section is based on an extensive experiment and effective investigation of all the different datasets on the proposed DeepShufNet. All experiments were conducted in MATLAB R2020b on a desktop PC built with an Intel(R) core i5 (3.2 GHz) processor, 8 GB of RAM, and an NVIDIA GeForce GTX 1070 GPU server with 120 G memory.
Taking into consideration the condition of the hardware and the issue of out-ofmemory errors, we reduced the batch size to 200 for both training and testing. Considering the huge data sparsity within the Coswara dataset classes, the repeated experiments were conducted five times.

Training and Testing Prediction
The proposed DeepShufNet model was trained and tested on the feature-extracted images combined from all Coswara datasets. Cross-validation method was applied to find the optimal parameter configuration and the model was trained and validated on 80% of the total images extracted from the sound dataset, which consist of 1706 data samples comprising healthy, positive asymptotic, positive mild, positive moderate, recovered full, RINI, and NRIE with 1098, 34, 185, 58, 79, 120, and 132, respectively. The adaptive momentum algorithm ADAM was used as the training algorithm, and different hyperparameter values as summarized in Table 6. The learning rate controls the rate of the weights update, therefore reducing the prediction error, while the batch size helps to determine the number of sample rows processed/time before updating the parameters of the internal network. The baseline experiment was evaluated using the raw feature-extracted images, the training process was with and without fine-tuning. The final DeepShuffleNet model was selected using the model with the least loss in the validation set during training. The training model for each experiment was analyzed and observations of improvement in the classification results to validation accuracy and losses were noted. The results of the original dataset without augmentation suffer from the increasing misclassification rate of the minority class, especially in the case of classifying positive asymptotic and positive COVID-19 classes with a recall and precision rate of almost NA to less than 10%. However, training the DeepShufNet model with our categories of synthetic dataset gave a near-optimal result with a better performance in detection of COVID-19.
The experimental results are presented in four comparative categories and all results were obtained based on the experiments with the test dataset. The overall performance of the model with each category of dataset is compared using an optimal model in five recorded experiments in this research. In each comparative experiment, the combination of accuracy, recall and specificity is the main metric to judge the performance of the model in each dataset's categories, since it examines both classes outcomes and improvement in the classification results for the minority class. The detailed summary of all measures for each category is all stated as follows.

Classification Deep Breath Sound (All Positive COVID-19 vs. Healthy)
This section compares the results of the transfer learning DeepShufNet on 224 × 224 pixels for binary classification of healthy versus all positive classes. Due to the similarities between the positive mild and moderate classes, we combined these two classes to create a new class called the All-positive-Covid class. A comparison of the detection power of our proposed DeepShufNet on the Mel spectrogram feature images and GFCC features is shown in Table 7. The classification results reflect some improvement and stability of the DeepShuffleNet in the data augmentation datasets. On the test set, the best performance for DeepShufNet was achieved using the Mel spectrogram image in the COCOA-2 dataset (see Table 6), with an enriching positive COVID-19 detection case summarized as mean accuracy with 85.1 (standard deviation [SD], 4.23), 70.85 (SD, 7.7) for recall/sensitivity, 59.64 (SD, 13.12) for precision, 88.25 (6.14) for specificity, and 63.61 (SD, 6.7) for F1-score. However, the test set results of our proposed model on COCOA-3 show a substantial improvement in accuracy mean of 87.82 (SD, 1.3), 69.49 (SD, 4.9) for recall/sensitivity, 64.82 (SD, 4.7) for precision, 91.75 (1.9) for specificity, and 66.9 (SD, 2.8) for F1-score. Therefore, the test set comparison of the original dataset without augmentation can be said to perform the worst when compared with the outcome of the other datasets. The datasets using the GFCC images with augmentation still outperforms the original datasets with significant comparison result of accuracy as 83.1 (SD, 1.4), 83.05 (SD, 0.9), 76.4 (SD, 2.5), and 74.9 (SD, 3.8) for COCOA-3, COCOA-2, COCOA-1, and raw data (no augmentation), respectively. More interesting is the increasing mean recall for DeepShufNet being 71.33 (SD, 2.2), 48.7 (SD, 14.1), 46.7 (SD, 11.5), and 38.8 (SD, 9.3) for COCOA-1, no aug, COCOA-2, and COCOA-3, respectively.
The summary of DeepShufNet on the Mel spectrogram images is presented in Figure 6, which reflects the best experimental outcome for COCOA-2 with values for accuracy, recall, specificity, precision, and F1-score being 90.1%, 62.71%, 95.99%, 77.1%, and 69.2%, respectively. The second-best results were achieved with COCOA-3, with an accuracy of 89.5%, 71.2% recall, 93.4% specificity, 70% precision, and 70.6% F1-score. The worst result was achieved by the raw dataset without augmentation, with an accuracy of 79%, 54.23% recall, 84.3% specificity, 42.67% precision, and 47.76% F1-score. In the same manner, Figure 7 shows comparison results of DeepShufNet for GFCC images. The application of noise augmentation COCOA-2 and the combo datasets (COCOA-3) show 84.1% and 84.7% accuracy, respectively. The two best recall results were achieved by COCOA-1 and COCOA-2, which depicts that the application of the data augmentation approach helps to improve classification results.

Experimental Results: Positive Asymptotic vs. Healthy
Aiming to indicate the contribution of our proposed DeepShufNet models, a second experiment was conducted to classify the healthy versus positive asymptotic alone. The wide margin in data sparsity between these two classes could result in serious overfitting of the model. However, the growth in the performance metrics for both Mel-spectrogram and GFCC images has not been continuous for the raw dataset, but the application of data augmentation approach on training data has reduced overfitting with a training accuracy much lower than the accuracy of testing in the last epoch. In summary, the experimental results indicate that the training with augmented datasets has not had a significant influence on the improvement of classification accuracy; however, training the model with COCOA-1 showed a good classification performance on the test sets in terms of accuracy, but the second worst results for recall rate. On the other hand, training our DeepShufNet with COCOA-2 slightly increases the test classification accuracy, specificity, and F1-score. Considering the efficiency of the data augmentation methods, classification using noise augmentation is more suitable for practical application when the dataset is small, as reflected in Table 7. Figures 8 and 9 show an improvement in the augmentation of Mel-spectrogram images with higher performance results in recall rate, precision, and F1-score. Therefore, we can claim that the impact of data augmentation methods in both feature extraction images achieved a more remarkable improvement in classification results on the proposed DeepShufNet model.  The experimental results from Table 8 show an improvement using the data augmentation method as compared to the baseline experiment with the best accuracy being achieved by COCOA-1 with an accuracy of 97.15% (SD, 0.5); 95.8% (SD, 1.1) for COCOA-2; 92.7% (SD, 0.17) for COCOA-3; and 92.2% (SD, 0.9) for no aug data.

Experimental Results: Healthy vs. Recovered-Full
In this experiment, we tried to validate the effectiveness of our proposed model by analyzing the detection rate of the DeepShufNet model in classifying healthy against recovered. This experimental results of the applied model on the four datasets based on MFCC feature-extracted images, namely raw data (no aug), COCOA-1, COCOA-2, and COCOA-3, which gave the following performance results for accuracy: 93.45 (SD, 0.41) for COCOA-2: 93.33 (SD, 0.51) for COCOA-1; 91.68 (SD, 4.0) for COCOA-3; and 91.03 (SD, 0.8) for no augmentation (see Table 9). Figures 10 and 11 show the best results of all the four datasets on the DeepShufNet model, and it reflects that the combination of the two data augmentation techniques (COCOA-3) gave the best results.

Limitations
One of the major issues faced in this study is the problem of misclassification errors associated with the poor generalization of some noisy images. As expected, the majority of the error in misclassification can be attributed to a serious imbalance of classes and limited data samples. The differences between each class of sound, when represented as either Melspectrogram images or GFCC feature images, are almost similar to power representation and this could impact the ability of the model to generalize the data efficiently. The generated spectrogram for each audio file is a two-dimensional array of intensity values that is majorly noisy because of environmental noises connected to audio signals [55]. Therefore, it is important to equalize values distribution to enhance feature learning.
The proposed model is designed based on existing data augmentation techniques (color transformation and noise) and the features in the frequency domain, which makes the model simple and intuitive with low space cost. On the one hand, image spectra for sound signals could be a complex system, since some of the images cannot fully reflect the characteristic information of sound signals, although the frequency-domain feature has been used by previous researchers in sound classification tasks.
Regardless of these limitations, the proposed DeepShufNet model has proven to be effective in terms of the detection of COVID-19, despite the gross imbalance in classes and the limited dataset. Moreover, it has low computational complexity in terms of resources and time. In the future, there is still a need to explore more complex data augmentation methods to overcome some of the errors due to the misclassification of the images by generating a cleaner dataset for proper generalization.

Comparison to Related Work on COVID-Sound Databases and Discussion
Further comparison in terms of accuracy, recall, and precision was carried out between our proposed system and other existing COVID-19 sound database systems. Despite applying different experimental conditions to each classification task, the proposed DeepShufNet model shows improved and promising results with respect to COVID-19 detection compared to the existing studies. The summary of the comparison table with related work is presented in Table 10.

Conclusion
The increasing popularity of the application of different deep neural network models in sound classification tasks is quite impressive. However, there has been some research work on COVID-19 detection based on different CNN architectures and some of the publicly available datasets still suffer from huge data imbalance, limited datasets, and poor classification of some of the machine learning models. Therefore, this work aims to apply a deep learning model, called DeepShufNet, to different categories of data augmentation techniques. The main contributions of this work include:

1.
Covering the gap between limited datasets and class imbalance by creating a larger corpus of synthetic datasets using some simple and effective data augmentation techniques. Additionally, three different synthetic datasets were created, namely COCOA-1, COCOA-2, COCOA-3.

2.
Deep learning based on pre-trained Shufflenet architecture, called the DeepShufNet model, was trained and evaluated on the analyzed datasets for comparison. The experimental analysis of the augmented datasets in comparison with baseline results showed significant improvement in performance metrics, better data generalization and enhanced optimal test results.
We compared and analyzed the effects of the two different feature extraction methods, namely Mel-spectrogram and GFCC imaging, on the DeepShufNet model. This study investigated the effects of augmented images in the detection of COVID-19, including positive asymptotic cases, and fully recovered cases. The results showed that the DeepShufNet model had the highest accuracy on COCOA-2 Mel-spectrogram images for almost all the comparison cases. The proposed DeepShufNet models showed an improved performance, especially in the recall rate, precision, and F1-Score rate for all three types of augmented images. The proposed model showed the highest test results, with scores for accuracy, precision, recall, specificity, and f-score being 90.1%, 77.1%, 62.7%, 95.98%, and 69.1%, respectively, for positive COVID-19 detection using the Mel COCOA-2 training datasets. In the same manner, the experimental result for the detection of positive asymptotic achieved the best recall rate of 62.5% and specificity rate of 97.1%, and a 48% F1-score.
In the future, we will explore advanced data augmentation techniques such as the application of generative adversarial networks (GANs) to train and test the model. Furthermore, more deep learning architectures will be implemented to improve and enhance COVID-19 recognition performance. In addition, the proposed DeepShufNet deep learning model could also be applied and evaluated with the combination of all the different sound datasets.