Automated COVID-19 and Heart Failure Detection Using DNA Pattern Technique with Cough Sounds

COVID-19 and heart failure (HF) are common disorders and although they share some similar symptoms, they require different treatments. Accurate diagnosis of these disorders is crucial for disease management, including patient isolation to curb infection spread of COVID-19. In this work, we aim to develop a computer-aided diagnostic system that can accurately differentiate these three classes (normal, COVID-19 and HF) using cough sounds. A novel handcrafted model was used to classify COVID-19 vs. healthy (Case 1), HF vs. healthy (Case 2) and COVID-19 vs. HF vs. healthy (Case 3) automatically using deoxyribonucleic acid (DNA) patterns. The model was developed using the cough sounds collected from 241 COVID-19 patients, 244 HF patients, and 247 healthy subjects using a hand phone. To the best our knowledge, this is the first work to automatically classify healthy subjects, HF and COVID-19 patients using cough sounds signals. Our proposed model comprises a graph-based local feature generator (DNA pattern), an iterative maximum relevance minimum redundancy (ImRMR) iterative feature selector, with classification using the k-nearest neighbor classifier. Our proposed model attained an accuracy of 100.0%, 99.38%, and 99.49% for Case 1, Case 2, and Case 3, respectively. The developed system is completely automated and economical, and can be utilized to accurately detect COVID-19 versus HF using cough sounds.


Introduction
The COVID-19 pandemic is continuing to the present time despite recent vaccination efforts. Experts advise people to continue to wear masks, implement sanitization procedures, and avoid crowds [1,2]. Curfews still exist in many countries. COVID-19 has disrupted normal life and has strained national health resources, even more so at the beginning of the pandemic [3]. A new normal is necessary to limit its spread [4] and people are often living in isolation according to quarantine rules [5,6]. Many patients with

•
New local feature generator based on graph theory and the chemical structure of nucleotide basic units of the DNA molecule, which we labelled as DNA pattern-based. • New prospectively acquired dataset comprising cough sounds recorded from healthy subjects, COVID-19, and HF patients using basic smart phone microphones, which we divided into standardized one-second sound segments for analysis.

•
To the best our knowledge, this is the first work to automatically classify healthy subjects, HF and COVID-19 patients using cough sounds signals.

•
The DNA pattern-and ImRMR-based model combined with the standard kNN classifier attained excellent results, with greater than 99% accuracy for every Case.

Material
Using various mobile phones, cough sounds were recorded from 247 healthy subjects as well as 241 COVID-19 and 244 HF patients who attended Firat University Hospital, and stored in m4a (719), mp3 (3) or ogg (10) formats. Ethical approval for the study was obtained from the Firat University Ethics Committee. These recordings were of different durations and had to be subdivided into standardized one-second sound segments for analysis. There were 696 (32%), 906 (42%) and 554 (26%) sound segments from healthy subjects, COVID-19 and HF patients, respectively, out of a total of 2156 segments.

Method
The model comprised a graph-based local feature generator, an iterative feature selector, and classification components. The former used graphical depictions of the chemical structures of nucleotide basic units of the DNA molecule, purine and pyrimidine, to generate features from cough sounds. The optimal number of features was selected using ImRMR and classification of the chosen features performed using standard kNN classifier. A schematic of this model is shown in Figure 1. graphic (CT) chest images [31], and attained a 96.25% accuracy rate for discriminating between Covid-19 (+) vs. Covid-19 (−) status. Singh et al. [32] applied a CNN model on CT chest images and attained a 93.50% accuracy rate for a binary classification of images into infected (+) vs. infected (−). Horry et al. [33] used a transfer learning-based method that analyzed X-ray, CT, and ultrasound images from four different datasets-Covid-19 image data collection [34], NIH chest X-Ray [35], Covid-CT [36], and POCOVID [37]-and for each imaging modality, calculated the performance metrics of the different analysis models that included VGG16 [38], VGG19 [38], Xception [39], InceptionResNetV2 [40], Incep-tionV3 [41], NASNetLarge [42], DenseNet121 [43], and ResNet50V2 [44]. For instance, F1score values for VGG19 were 87.00%, 99.00%, and 78.00% for X-ray, ultrasound, and CT, respectively. Zebin and Rezvy [45] applied a CNN method to analyze chest X-ray images for initial Covid-19 classification into Covid-19, normal and pneumonia classes, as well as for monitoring of disease progression. They reported 90.00%, 96.80%, and 94.30% accuracy rates for VGG-16, EfficientNetB0 [46] and ResNet50 models, respectively.

Material
Using various mobile phones, cough sounds were recorded from 247 healthy subjects as well as 241 Covid-19 and 244 HF patients who attended Firat University Hospital, and stored in m4a (719), mp3 (3) or ogg (10) formats. Ethical approval for the study was obtained from the Firat University Ethics Committee. These recordings were of different durations and had to be subdivided into standardized one-second sound segments for analysis. There were 696 (32%), 906 (42%) and 554 (26%) sound segments from healthy subjects, Covid-19 and HF patients, respectively, out of a total of 2156 segments.

Method
The model comprised a graph-based local feature generator, an iterative feature selector, and classification components. The former used graphical depictions of the chemical structures of nucleotide basic units of the DNA molecule, purine and pyrimidine, to generate features from cough sounds. The optimal number of features was selected using ImRMR and classification of the chosen features performed using standard kNN classifier. A schematic of this model is shown in Figure 1. The pseudocode of the model is given in Algorithm 1.  The pseudocode of the model is given in Algorithm 1.

DNA Pattern
A new DNA pattern-based local feature generator was proposed. There have been several graph-based feature extraction models in the literature [18,47] and molecular structure graphs used in deep learning models and graph networks have attained high classification performance [48,49]. In this study, we used the aromatic heterocyclic chemical structures of nucleotide basic units of the DNA molecule purine with its fused six-and fivemembered ring conformation; and pyrimidine, its six-membered ring to generate features from cough sound signal segments. Each purine nucleotide unit (adenine, guanine) on one DNA strand is hydrogen-bonded to the corresponding pyrimidine nucleotide unit (thymine, cytosine) of the second DNA strand (base pairing) to collectively form the DNA double helix, which is the basis of our genetic code. The chemical structures of purines and pyrimidines are topologically distinctive and can be represented as directed cyclic graphs ( Figure 2). These graphs are utilized as the pattern of a histogram-based local feature generator. As can be seen in Figure 2, there are 25 edges in these two graphs, and these edges are denoted parameters of generated binary features.

DNA Pattern
A new DNA pattern-based local feature generator was proposed. There have been several graph-based feature extraction models in the literature [18,47] and molecular structure graphs used in deep learning models and graph networks have attained high classification performance [48,49]. In this study, we used the aromatic heterocyclic chemical structures of nucleotide basic units of the DNA molecule purine with its fused sixand five-membered ring conformation; and pyrimidine, its six-membered ring to generate features from cough sound signal segments. Each purine nucleotide unit (adenine, guanine) on one DNA strand is hydrogen-bonded to the corresponding pyrimidine nucleotide unit (thymine, cytosine) of the second DNA strand (base pairing) to collectively form the DNA double helix, which is the basis of our genetic code. The chemical structures of purines and pyrimidines are topologically distinctive and can be represented as directed cyclic graphs ( Figure 2). These graphs are utilized as the pattern of a histogram-based local feature generator. As can be seen in Figure 2, there are 25 edges in these two graphs, and these edges are denoted parameters of generated binary features.  Directed cyclic graphical representations of purine (fused six-and five-membered ring conformation) and pyrimidine (six-membered ring). Individual directed paths are constructed using red arrows, which are enumerated. The initial and final points of each arrow represent the first and second parameters of the signum function for bit generation, respectively. With both structures combined, 25 bits (total number of directed paths) can be generated using 5 × 7 and 6 × 5 sized matrices (see text).
A schematic of the proposed DNA pattern-based feature generation is shown in Figure 3.  Steps of the proposed DNA pattern-based feature generation: Step 1: Divide cough sound into overlapping blocks with a size of 35.
Step 2: Create first matrix with a size of 5 × 7 using vector to matrix transformation.
Step 3: Use the purine pattern and signum function to generate 14 bits. The definition of the signum function is given in Equation (1).
where (. , . ), and are the signum function first and second parameters, respectively.
Step 4: Divide cough sound into overlapping blocks of size 30.
Step 5: Create a second matrix with dimension 6 × 5 using vector-to-matrix transformation.
Step 6: Use the pyrimidine pattern and signum function to generate 11 bits.
Step 7: Merge the generated bits (total 25 bits) from Steps 3 and 6.
Step 8: Divide these bits into left, middle and right groups.
Step 9: Create three map signals using the generated bit groups.
where 1 , 2 and 3 are the generated first, second, and third map sounds for feature generation. Histograms of these map sounds are extracted to obtain feature vectors. From Equations (5)-(9), these signals are coded with 8, 9, and 8 bits, respectively. Steps of the proposed DNA pattern-based feature generation: Step 1: Divide cough sound into overlapping blocks with a size of 35.
Step 2: Create first matrix with a size of 5 × 7 using vector to matrix transformation.
Step 3: Use the purine pattern and signum function to generate 14 bits. The definition of the signum function is given in Equation (1).
where γ(., .), f and s are the signum function first and second parameters, respectively.
Step 4: Divide cough sound into overlapping blocks of size 30.
Step 6: Use the pyrimidine pattern and signum function to generate 11 bits.
Step 7: Merge the generated bits (total 25 bits) from Steps 3 and 6.
Step 8: Divide these bits into left, middle and right groups.
Step 9: Create three map signals using the generated bit groups.
Diagnostics 2021, 11, 1962 6 of 15 where m 1 , m 2 and m 3 are the generated first, second, and third map sounds for feature generation. Histograms of these map sounds are extracted to obtain feature vectors. From Equations (5)-(9), these signals are coded with 8, 9, and 8 bits, respectively.
Step 10: Extract histograms of m 1 , m 2 , and m 3 . The lengths of the created histograms of m 1 , m 2 , and m 3 are calculated as 2 8 , 2 9 , and 2 8 , respectively.
Step 11: Merge the extracted histograms to obtain the feature vector of the DNA pattern.

Feature Selection
For automatic selection of the optimal number of generated features, we proposed an iterative version of the maximum relevance minimum redundancy selector (mRMR) [50], ImRMR, that incorporated an error calculator with kNN classifier. A schematic of the ImRMR selector is shown in Figure 4. Step 10: Extract histograms of 1 , 2 , and 3 . The lengths of the created histograms of 1 , 2 , and 3 are calculated as 2 8 , 2 9 , and 2 8 , respectively.
Step 11: Merge the extracted histograms to obtain the feature vector of the DNA pattern.
The eleven steps above define the DNA pattern-based feature generation. 1024 features are generated from each sound segment by deploying these steps.

Feature Selection
For automatic selection of the optimal number of generated features, we proposed an iterative version of the maximum relevance minimum redundancy selector (mRMR) [50], ImRMR, that incorporated an error calculator with kNN classifier. A schematic of the ImRMR selector is shown in Figure 4. Steps involved in the selection of an optimal number of features using the ImRMR selector.
By deploying ImRMR, each of the 1024 features extracted by the DNA pattern is selected iteratively, and the kNN classifier employed to calculate the resultant error rates of the selected feature vector. The steps of the ImRMR used are detailed below.
Step 2: Select features using the that has been calculated in Step 1.
where represents ith selected features, and is the number of observations. Here, iterative feature selection is described. By deploying ImRMR, each of the 1024 features extracted by the DNA pattern is selected iteratively, and the kNN classifier employed to calculate the resultant error rates of the selected feature vector. The steps of the ImRMR used are detailed below.
Step 2: Select features using the id that has been calculated in Step 1. where s f i represents ith selected features, and k is the number of observations. Here, iterative feature selection is described.
Step 3: Calculate loss values of each feature vector selected using the kNN classifier with 10-fold cross-validation.
In Equation (12), µ and kNN(.) represent the error value and the kNN classifier, respectively.
Step 4: Find the minimum loss value.
Step 5: Select optimal feature vector (last) using index (ind) of the minimum error value.

Classification
A standard distance classifier (kNN) [19] was utilized for selecting the best and optimal number of feature vectors (it functioned as error value generator, see Section 2.2.2) as well as for calculating the classification results. Parameters of the kNN are: k was selected as one; distance parameter, Spearman; distance weight, equal; and standardize, true. Ten-fold cross-validation was chosen as the validation technique.

Experimental Setup
The MATLAB (2020b) coding environment was used to develop the proposed DNA pattern-and ImRMR-based cough sound classification model. Systems configuration of the computer used were as follows: Operating system: Window 10.1 professional, RAM: 48 gigabytes, CPU: Intel i9 9900 with 3.60 GHz cycling frequency, Specifically, neither graphical core nor parallel processing was used to develop the model.

Cases
To evaluate the proposed model comprehensively, three distinct clinically relevant classification problems were defined based on the collected cough sound dataset:

Results
Standard performance metrics including accuracy, sensitivity, precision, F1-score, and geometric mean [51] were evaluated (see Table 1) and confusion matrices constructed ( Figure 5) for all Cases. High classification accuracy rates of 99.38%, 100% and 99.49% were attained for Case 1, Case 2 and Case 3, respectively, with low rates of classification error.  The time burden (computational complexity) of the presented model was denoted using big O notation. The time complexity of the DNA pattern-based local feature generator function was ( ), where n was the length of the cough sound segment analyzed. ImRMR used both kNN and mRMR, and constituted the most complex phase of the model. Its time burden was ( 2 ), where , and were the iteration number, length of the features, and number of observations, respectively. In the classification phase, kNN was used and the associated time complexity was ( ).

Discussion
Cough sound-based Covid-19 detection is an emerging field of research for both clinicians and machine learning experts. The prevalence and incidence of HF has been on the increase even before the onset of the Covid-19 pandemic, and is now often affected by a lack of access to routine medical care. The clinical presentations of both Covid-19 and HF can overlap, which underscores the need for the development of computer-aided diagnostic tools to support clinicians in triage and management. Both conditions can induce cough symptoms. Therefore, we collected cough sounds from Covid-19 and HF patients, as well as healthy subjects, to test the performance of our proposed DNA pattern-and ImRMR-based model. Our proposed model is able to classify three clinically relevant classification problems: Covid-19 vs. healthy; HF vs. healthy; and Covid-19 vs. HF vs. healthy. The model generated 1024 features from each one-second cough sound segment. An iterative feature selector is employed to select the most discriminative features. We presented the results obtained using ImRMR, iterative neighborhood component analysis (INCA), iterative ReliefF (IRF) and iterative Chi2 (IChi2) feature selectors. The plots of error rates

Discussion
Cough sound-based COVID-19 detection is an emerging field of research for both clinicians and machine learning experts. The prevalence and incidence of HF has been on the increase even before the onset of the COVID-19 pandemic, and is now often affected by a lack of access to routine medical care. The clinical presentations of both COVID-19 and HF can overlap, which underscores the need for the development of computer-aided diagnostic tools to support clinicians in triage and management. Both conditions can induce cough symptoms. Therefore, we collected cough sounds from COVID-19 and HF patients, as well as healthy subjects, to test the performance of our proposed DNA patternand ImRMR-based model. Our proposed model is able to classify three clinically relevant classification problems: COVID-19 vs. healthy; HF vs. healthy; and COVID-19 vs. HF vs. healthy. The model generated 1024 features from each one-second cough sound segment. An iterative feature selector is employed to select the most discriminative features. We presented the results obtained using ImRMR, iterative neighborhood component analysis (INCA), iterative ReliefF (IRF) and iterative Chi2 (IChi2) feature selectors. The plots of error rates versus number of features selected using these feature selectors implemented for Case 3 are shown in Figure 6. It can be noted from Figure 6 that the number of features selected corresponding to least error rates for Case 3 classification using IChi2, INCA, IRF and ImRMR are 226, 802, 701, and 895, respectively. The minimum error rate of 0.0051 is obtained for ImRMR, 0.006 for IChi2, INCA, and IRF selectors. Application of ImRMR to Case 1 and Case 2 yielded minimum error rates of 0.0062 and 0 for 198 and 50 selected features, respectively ( Figure  7). Overall, the model attained 99.38%, 100% and 99.49% accuracy rates for Case 1, Case 2 and Case 3 classifications, respectively. It can be noted from Figure 6 that the number of features selected corresponding to least error rates for Case 3 classification using IChi2, INCA, IRF and ImRMR are 226, 802, 701, and 895, respectively. The minimum error rate of 0.0051 is obtained for ImRMR, 0.006 for IChi2, INCA, and IRF selectors. Application of ImRMR to Case 1 and Case 2 yielded minimum error rates of 0.0062 and 0 for 198 and 50 selected features, respectively ( Figure 7). Overall, the model attained 99.38%, 100% and 99.49% accuracy rates for Case 1, Case 2 and Case 3 classifications, respectively.
The Standard kNN classifier is employed for calculating the error rate during the feature selection phase (see Section 2.2.3) in order to obtain classification results. We have used decision tree (DT) [52], linear discriminant (LD) [53], naïve Bayes (NB) [54], support vector machine (SVM) [55], kNN [19], bagged tree (BT) [56], and subspace discriminant (SD) [57] classifiers in addition to kNN for the classification tasks using 1024 features. It can be noted from Figure 8 that the best results are obtained using the kNN classifier. Therefore, kNN is selected both as the classifier and the error/loss value generator in the features selection phase. The Standard kNN classifier is employed for calculating the error rate during the feature selection phase (see Section 2.2.3) in order to obtain classification results. We have used decision tree (DT) [52], linear discriminant (LD) [53], naïve Bayes (NB) [54], support vector machine (SVM) [55], kNN [19], bagged tree (BT) [56], and subspace discriminant (SD) [57] classifiers in addition to kNN for the classification tasks using 1024 features. It can be noted from Figure 8 that the best results are obtained using the kNN classifier. Therefore, kNN is selected both as the classifier and the error/loss value generator in the features selection phase. The performance parameters (%) obtained for automated Covid-19 detection using cough sound signals is depicted in Table 2.  The Standard kNN classifier is employed for calculating the error rate during the feature selection phase (see Section 2.2.3) in order to obtain classification results. We have used decision tree (DT) [52], linear discriminant (LD) [53], naïve Bayes (NB) [54], support vector machine (SVM) [55], kNN [19], bagged tree (BT) [56], and subspace discriminant (SD) [57] classifiers in addition to kNN for the classification tasks using 1024 features. It can be noted from Figure 8 that the best results are obtained using the kNN classifier. Therefore, kNN is selected both as the classifier and the error/loss value generator in the features selection phase. The performance parameters (%) obtained for automated Covid-19 detection using cough sound signals is depicted in Table 2. The performance parameters (%) obtained for automated COVID-19 detection using cough sound signals is depicted in Table 2.  The benefits and disadvantages of our proposed DNA pattern-based method are given below.
The benefits are as follows.
• Developed a new cough sound dataset, which was collected from healthy subjects, and COVID-19 and HF patients.

•
Presented a novel histogram-based feature generator inspired by DNA patterns. To the best our knowledge, this is the first work to automatically classify healthy subjects, HF and COVID-19 patients using cough sounds signals. • Proposed a DNA pattern-and ImRMR-based model which attained greater than 99% accuracy for all (binary and multiclass) defined classification problems. • Generated an automated model based on cough sounds that is accurate, economical, rapid, and computationally lightweight.

•
The limitations of this work are given below:

•
The system should be validated with a larger dataset prior to clinical application.

•
Only a three-class system was used (normal, COVID-19 and HF).
We have presented a histogram-based hand-modeled feature generation function using the DNA molecular pattern. New-generation deep learning models based on molecular shapes can be further studied to improve model performance. A snapshot of cloud-based cough detection via mobile application with cough sounds is presented in Figure 9.
The benefits and disadvantages of our proposed DNA pattern-based method are given below.
The benefits are as follows.
 Developed a new cough sound dataset, which was collected from healthy subjects, and Covid-19 and HF patients.  Presented a novel histogram-based feature generator inspired by DNA patterns. To the best our knowledge, this is the first work to automatically classify healthy subjects, HF and Covid-19 patients using cough sounds signals.  Proposed a DNA pattern-and ImRMR-based model which attained greater than 99% accuracy for all (binary and multiclass) defined classification problems.  Generated an automated model based on cough sounds that is accurate, economical, rapid, and computationally lightweight.
The limitations of this work are given below:


The system should be validated with a larger dataset prior to clinical application.  Only a three-class system was used (normal, Covid-19 and HF).
We have presented a histogram-based hand-modeled feature generation function using the DNA molecular pattern. New-generation deep learning models based on molecular shapes can be further studied to improve model performance. A snapshot of cloudbased cough detection via mobile application with cough sounds is presented in Figure 9.

Conclusions
This paper presents a new automated COVID-19 and HF failure detection model using cough sounds. This model extracts subtle features from a cough sound signal using a histogram-based feature generator with a chemical structure of DNA molecule. The proposed DNA patterns used for feature bit generation, combined with the ImRMR and kNN classifier, yielded an accuracy of 99.38%, 100%, and 99.49% for COVID-19 vs. healthy, HF vs. healthy, and COVID-19 vs. HF vs. healthy diagnoses, respectively. The model is accurate, economical and computationally lightweight. In the future, we intend to detect asthma in addition to the three classes currently used for cough sound signal analysis. Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. The data are not publicly available due to restrictions regarding the Ethical Committee Institution.