An Automated System for Classification of Chronic Obstructive Pulmonary Disease and Pneumonia Patients Using Lung Sound Analysis

Chronic obstructive pulmonary disease (COPD) and pneumonia are two of the few fatal lung diseases which share common adventitious lung sounds. Diagnosing the disease from lung sound analysis to design a noninvasive technique for telemedicine is a challenging task. A novel framework is presented to perform a diagnosis of COPD and Pneumonia via application of the signal processing and machine learning approach. This model will help the pulmonologist to accurately detect disease A and B. COPD, normal and pneumonia lung sound (LS) data from the ICBHI respiratory database is used in this research. The performance analysis is evidence of the improved performance of the quadratic discriminate classifier with an accuracy of 99.70% on selected fused features after experimentation. The fusion of time domain, cepstral, and spectral features are employed. Feature selection for fusion is performed through the back-elimination method whereas empirical mode decomposition (EMD) and discrete wavelet transform (DWT)-based techniques are used to denoise and segment the pulmonic signal. Class imbalance is catered with the implementation of the adaptive synthetic (ADASYN) sampling technique.


Introduction
Pulmonary abnormalities encompass various lethal diseases. Chronic obstructive pulmonary disease (COPD) and pneumonia are treatable pulmonic illness with early diagnosis and proper prevention. Pneumonia is a pulmonary abnormality which can be caused by virus, bacteria, or fungi. Infections. COPD subjects are also vulnerable to a high risk of pneumonia. The subjects who develop pneumonia are more likely to die. According to United Nations Children's Fund (UNICEF), half of the morbidity from the recorded 5.9 million under-five deaths has caused due to infectious illnesses and conditions in which pneumonia lies at premier rank in 2015 [1]. It is a challenging task to distinguish pneumonia and acute exacerbation of COPD as both displays the same symptoms [2]. Exacerbations as well as cough are commonly found in both COPD and pneumonia patients. The frequent global pervasiveness of COPD is also evidenced by the WHO statistics. A reduction of 0.3% is estimated in global mortality in 2016 when it is compared with estimates of 2000 as shown in Figures 1 and 2 [3]. Most importantly, COPD patients are vulnerable to a high risk of pneumonia and other related diseases like bronchitis. Due to viral, bacterial, or both infections, COPD lies at the third position in global mortality characterized by frequent exacerbations. It damages the lungs and blocks airways with mucus does not allow them to function properly. Underestimation of the death rate due to COPD often  Signal analysis in conjunction with the machine learning (ML) approaches on LS analysis can provide better results for COPD and pneumonia diagnosis [6,7]. Feature engineering to existing ML prototypes can be a useful way to differentiate between the same adventitious sounds of different pulmonic diseases. These bottlenecks demand an efficient, economic, convenient, and noninvasive diagnostic methodology for the identification of pulmonic disease [8]. The need for an hour is to design a prototype or a system capable of accurately detect and classify pulmonary diseases like COPD and pneumonia from a simple and less intrusive modality.

Literature Review
Recently, machine learning (ML) schemes have been reported to identify a single lung disorder from lung sound (LS) analysis of adventitious sounds [9]. Different schemes devised to analyze the LS analysis via electronic auscultation is a better alternative approach to trace pulmonary diseases against invasive and costly imaging diagnostic techniques [10][11][12][13]. Numerous studies performed on pulmonary issues for early diagnosis but these abnormalities are quite complex and complicated. High cost to build large scale and well-labeled episodes are major constraints to realize this approach. On the contrary, limited training data will raise the model overfitting and low-reliability issues [14]. for the disease. The false diagnostic rate may be greater than 70% [4]. Qualitative analysis is an alternative approach but it is purely based on the expertise of medical experts. Lack of experience can cause irreversible loss to a patient. In the current era, the field of computer technology has made tremendous advancements in the early and rapid diagnosis of various adventitious sounds and pulmonary diseases from lung sound (LS) data banks [5]. Imaging pathologies like magnetic resonance imaging (MRI) and computerized tomography (CT) has provided the best diagnosis for pulmonary issues. Conversely, the cost of its machines, exposure to harmful radiations, and inconvenient to deploy in rural and far-flung areas are few bottlenecks.  . WHO statistics about worldwide death due to different respiratory issues in 2016 [3].
Signal analysis in conjunction with the machine learning (ML) approaches on LS analysis can provide better results for COPD and pneumonia diagnosis [6,7]. Feature engineering to existing ML prototypes can be a useful way to differentiate between the same adventitious sounds of different pulmonic diseases. These bottlenecks demand an efficient, economic, convenient, and noninvasive diagnostic methodology for the identification of pulmonic disease [8]. The need for an hour is to design a prototype or a system capable of accurately detect and classify pulmonary diseases like COPD and pneumonia from a simple and less intrusive modality.

Literature Review
Recently, machine learning (ML) schemes have been reported to identify a single lung disorder from lung sound (LS) analysis of adventitious sounds [9]. Different schemes devised to analyze the LS analysis via electronic auscultation is a better alternative approach to trace pulmonary diseases against invasive and costly imaging diagnostic techniques [10][11][12][13]. Numerous studies performed on pulmonary issues for early diagnosis but these abnormalities are quite complex and complicated. High cost to build large scale and well-labeled episodes are major constraints to realize this approach. On the contrary, limited training data will raise the model overfitting and low-reliability issues [14]. Signal analysis in conjunction with the machine learning (ML) approaches on LS analysis can provide better results for COPD and pneumonia diagnosis [6,7]. Feature engineering to existing ML prototypes can be a useful way to differentiate between the same adventitious sounds of different pulmonic diseases. These bottlenecks demand an efficient, economic, convenient, and noninvasive diagnostic methodology for the identification of pulmonic disease [8]. The need for an hour is to design a prototype or a system capable of accurately detect and classify pulmonary diseases like COPD and pneumonia from a simple and less intrusive modality.

Literature Review
Recently, machine learning (ML) schemes have been reported to identify a single lung disorder from lung sound (LS) analysis of adventitious sounds [9]. Different schemes devised to analyze the LS analysis via electronic auscultation is a better alternative approach to trace pulmonary diseases against invasive and costly imaging diagnostic techniques [10][11][12][13]. Numerous studies performed on pulmonary issues for early diagnosis but these abnormalities are quite complex and complicated. High cost to build large scale and well-labeled episodes are major constraints to realize this approach. On the contrary, limited training data will raise the model overfitting and low-reliability issues [14]. Novel predictive prototypes can incorporate both symptoms and physiological signals are being tested in telemonitoring interventions with positive outcomes.
However, its validation needs further investigation in the diagnostics of COPD and the identification of multiple lung diseases [15]. In time and accurate diagnosis can lower the death risk, but the subjective nature of adventitious sounds like coughs has led to high complexity in the detection of pneumonia, COPD, and other lung illnesses [16,17].
According to the literature review of related research work carried out in the last decade, most of the researchers have contributed to devising methods for diagnosing pulmonary issues via LS analysis, but a lot of efforts are still needed in this research area. Targeting single or multiple lung diseases, changes in focus groups having lung abnormality, to measure the intensity of a particular lung disease only, etc. are a few constraints that require attention. To the best of our knowledge, COPD and pneumonia LS is hardly addressed collectively in the context of LS analysis via digital signal processing (DSP) and ML techniques, so a literature review was performed targeting the studies focused on LS analysis for detection of pulmonary abnormalities, i.e., COPD and pneumonia particularly and other related lung illnesses in general.
A hybrid classifier is developed for a handheld device combined with the support vector machine (SVM), a random forest (RF), and a rule-based system to provide an advanced characterization scheme for COPD episodes in real-time. In this study, extensive data sets are required to refine the rule-based system. The data utilized in this research has missing values which are handled through two sub-algorithms to interpolate the missing information. However, the missing levels of the patient's information in the present dataset is a constraint of the presented system. Moreover, the focus on adventitious LS analysis could enhance the system reliability but the cough sound is only monitored in this research [11]. Cough analysis also provided an alternative way for the diagnosis of rapid childhood pneumonia. It is implemented by the logistic regression (LR) classification method for the identification of pneumatic and non-pneumatic subjects, i.e. asthma, and bronchitis. The classification is performed on the statistical and wavelet features. The author highlighted that only cough is not enough to diagnose pneumonia efficiently. This gap could be the reason for the low specificity of the system and required investigation [12,13]. A few authors have designed a pneumonia screening system from LS. Features are extracted from wavelet transform (WT), and power spectrum density (PSD). The thresholding of skewness, kurtosis, and statistical analysis is performed to recognize the pneumonia subjects from cough analysis. The researchers are reluctant to claim the maximum authenticity of the proposed system as a large data set could change the thresholding estimates to differentiate pneumonia and normal subjects [17][18][19]. In [20], the statistical analysis of COPD and pneumonia LS is performed to design a detection system for multiple lung diseases. The research highlighted the significant differences in various features of focused classes. These features include harmonic-to-noise ratio, pitch, and amplitude perturbation. Furthermore, the implication of these features to perform disease classification required attention. In another research, the spectrum of wheezes, crackles, and stridor was estimated to authenticate its variation in focused groups affected by nine different pulmonic diseases. These diseases include pulmonary edema, asthma, viral bronchitis, acute asthma, tuberculosis, rheumatoid, pneumonia, epiglottis, and laryngomalacia. In this research work, authentication is made on limited LS data in case of each disease which is insufficient. Pulmonary issues are comprised of high complexities and equally dependent on age, gender, and other factors. The effect of diverse focus groups in each disease is overlooked which could vary the spectrum position of abnormal LS in diverse bands with the same illness [21]. Multichannel LS signal is investigated for the classification of asthma, COPD, and normal classes. Statistical feature extraction form sub-bands of the PSD and artificial neural network (ANN) classifiers provide significant outcomes on self-collected LS data. The proposed method demonstrated the low specificity and sensitivity as compared to most of the latest research work and required authentication on real-time or authentic data. [22]. The intensity Sensors 2020, 20, 6512 4 of 24 of asthma is measured via wheeze analysis. The wheezing sound possessed different power spectral distributions in different bands according to severity levels. K-nearest neighbor (KNN) performed better than the ensemble (ENS) and SVM. The research work was based on a simulation environment without any realization in real-time. In special cases, wheeze may even absent in asthma patients. The data collection was made entirely from asthmatic subjects, however, the same adventitious sound is associated with COPD subjects which can affect the system accuracy in practice [23,24]. In another research article, a COPD diagnosis technique was based on the transfer learning (TL) approach referred to as a balanced probability distribution (BPD). The novel design accomplished better predictive capacity generally, even for a small COPD subject sample size and related common diseases. These common diseases included asthma, bronchitis, pneumonia, chronic bronchitis, and emphysema. This study mainly focused on applying the knowledge graph method on some COPD datasets and twenty-five electronic medical records. The maximum data required for the system limits its scope to adults only. Persons with disabilities in communication and pediatrics are unable to report dry throat, fatigue, itching, tongue coating, aster, and confidence features as reported in the research article for proper diagnosis. If any symptom or feature is reported due to human error then the proposed system is unable to diagnose properly. It emerges the need for an automatic, economic, and non-invasive system for the diagnosis of COPD [25][26][27]. Another study is made for the identification of pneumonia and asthma subjects from cough sound analysis, specifically focused on the pediatric population. The proposed study implemented the Mel frequency cepstral coefficient (MFCC), non-Gaussianity score, and Shannon entropy features to design the diagnostic system with the ANN technique. The system is entirely based on cough analysis to detect crackles sound. Clinically, it is not specific to pneumonia only. Other symptoms like a fever could enhance the specificity of the pneumonia diagnosis. Moreover, 44.4 % of the asthmatic subject also possessed the same issue which highlighted the need for a large database for reliable diagnosis [7]. Recently, a research article is published in which a convolutional neural network (CNN) is implemented on the LS database of the international conference on biomedical and health informatics (ICBHI). But the research work only focused to classify the adventitious sounds found in various pulmonic illnesses [28]. In another research, a novel approach called variational convolutional autoencoder is presented for unbalanced data and implemented on the same database [29].
Tremendous efforts have been made to identify COPD and pneumonia particularly, and some other lung illnesses in general from common adventitious sounds. Digital respiratory sounds provide valuable information for telemedicine and smart diagnostics in a non-invasive pathological detection way by an application of signal processing. Therefore, a comprehensive investigation is needed to devise a technique for the identification of COPD and pneumonia from common adventitious lung sounds. It is worth mentioning the important aspects to develop an efficient, robust, and reliable system for diagnosis of pulmonary pathologies, particularly COPD and pneumonia. It includes: (i) Data mining to extract the relevant and significant data of LS signals which should help to develop the diagnosis methods for COPD, pneumonia, and healthy LS. (ii) The design of the diagnosis method must be with simple statistical features that should not burden the system with computational cost and acquaint its performance with significant robustness. (iii) Investigation of minimum significant features required can be prolific to perform the classification of COPD, pneumonia, and healthy LS. (iv) Performance analysis of various classification methodologies would be required on selected features that are computationally smart.

Materials and Methods
The time and spectral characteristics of lung sound analysis can be used to differentiate between lung sounds. There is motivation to use the cepstral-based features for the identification of adventitious LS along with time spectral and other features. As sound is a common factor in speech and LS it is expected that these features can outperform along with others like in speech signal classification. The proposed diagnostic technique is constituted of four stages: (i) data acquisition from the database, (ii) preprocessing, (iii) feature extraction, and (iv) classification. The proposed method for diagnostics of COPD, pneumonia, and healthy LS is presented in Figure 3. ROI extraction is carried out through the EMD technique to keep the domain the same and avoid information loss in LS analysis. The intrinsic mode functions encompassing the low frequencies are selected to reconstruct the LS signals of COPD, pneumonia, and healthy subjects. After denoising, features from cepstral, time, and spectral-domain are fused to investigate the performance of the proposed method on different machine learning algorithms for the classification of COPD, pneumonia, and healthy subjects. in LS analysis. The intrinsic mode functions encompassing the low frequencies are selected to reconstruct the LS signals of COPD, pneumonia, and healthy subjects. After denoising, features from cepstral, time, and spectral-domain are fused to investigate the performance of the proposed method on different machine learning algorithms for the classification of COPD, pneumonia, and healthy subjects.

Database
LS data of healthy and unhealthy subjects from a public respiratory sound database of the International Conference on Biomedical and Health Information (ICBHI) is used [30]. It was compiled for an international competition, the first challenge of IMBE's International Conference on Biomedical Health Informatics. The database comprises 920 LS recordings from 126 healthy and unhealthy subjects. The unhealthy subjects included patients with COPD, upper respiratory tract infections (URTI), bronchiectasis, asthma, and pneumonia patients, as shown in Figure 4. In this research, 703 LS data sets of COPD, pneumonia, and a healthy subject are used. The selected data has a standard sampling frequency i.e. 44.1 kHz. Table 1 presents the information about recording equipment used to develop the LS database for research. Table 2 lists the demographic information of the focused section of the ICBHI database which consists of COPD, pneumonia, and healthy LS.

Database
LS data of healthy and unhealthy subjects from a public respiratory sound database of the International Conference on Biomedical and Health Information (ICBHI) is used [30]. It was compiled for an international competition, the first challenge of IMBE's International Conference on Biomedical Health Informatics. The database comprises 920 LS recordings from 126 healthy and unhealthy subjects. The unhealthy subjects included patients with COPD, upper respiratory tract infections (URTI), bronchiectasis, asthma, and pneumonia patients, as shown in Figure 4. in LS analysis. The intrinsic mode functions encompassing the low frequencies are selected to reconstruct the LS signals of COPD, pneumonia, and healthy subjects. After denoising, features from cepstral, time, and spectral-domain are fused to investigate the performance of the proposed method on different machine learning algorithms for the classification of COPD, pneumonia, and healthy subjects.

Database
LS data of healthy and unhealthy subjects from a public respiratory sound database of the International Conference on Biomedical and Health Information (ICBHI) is used [30]. It was compiled for an international competition, the first challenge of IMBE's International Conference on Biomedical Health Informatics. The database comprises 920 LS recordings from 126 healthy and unhealthy subjects. The unhealthy subjects included patients with COPD, upper respiratory tract infections (URTI), bronchiectasis, asthma, and pneumonia patients, as shown in Figure 4. In this research, 703 LS data sets of COPD, pneumonia, and a healthy subject are used. The selected data has a standard sampling frequency i.e. 44.1 kHz. Table 1 presents the information about recording equipment used to develop the LS database for research. Table 2 lists the demographic information of the focused section of the ICBHI database which consists of COPD, pneumonia, and healthy LS.  In this research, 703 LS data sets of COPD, pneumonia, and a healthy subject are used. The selected data has a standard sampling frequency i.e. 44.1 kHz. Table 1 presents the information about recording equipment used to develop the LS database for research. Table 2 lists the demographic Sensors 2020, 20, 6512 6 of 24 information of the focused section of the ICBHI database which consists of COPD, pneumonia, and healthy LS.

Pre-Processing and Segmentation
Preprocessing is a first and critical step in signal processing. It involves the removal of the baseline wander and high-frequency noise along with other artifacts that can corrupt the acquired signal. Heart sound, muscles, and skin artifacts are common interrupts. The database used in the research also lacks information about confounding noise sources. Therefore, empirical mode decomposition (EMD) [31] and discrete wavelet transform (DWT) [32] techniques are used to segment and remove the noisy portion of the signal.
The LS signal (raw) is demonstrated in Figures 5 and 6 in the time and frequency domain, respectively.
signal. Heart sound, muscles, and skin artifacts are common interrupts. The database used in the research also lacks information about confounding noise sources. Therefore, empirical mode decomposition (EMD) [31] and discrete wavelet transform (DWT) [32] techniques are used to segment and remove the noisy portion of the signal.
The LS signal (raw) is demonstrated in Figures 5 and 6 in the time and frequency domain, respectively.    (i) Empirical Mode Decomposition Empirical Mode Decomposition (EMD) is used in several kinds of research to decompose the signal and extract the RIO. In the EMD technique, the signal is decomposed into its components represented by an intrinsic nature function known as intrinsic mode functions (IMF). It has only one extreme between zero crossings and comprises zero mean value. An IMF is defined as a function which satisfies the following conditions: (a) In the whole data set, the number of extreme and the number of zero-crossings must either be equal or differ at most by one. extreme between zero crossings and comprises zero mean value. An IMF is defined as a function which satisfies the following conditions: (a) In the whole data set, the number of extreme and the number of zero-crossings must either be equal or differ at most by one. (b) At any point, the mean value of the envelope defined by the local maxima and the envelope defined by the local minima is zero.
If L[n] is the acquired LS signal, then m 1 be the mean of its upper and lower envelopes estimated from a cubic-spline interpolation of local maxima and minima. Therefore, the first IMF can be calculated by Equation (1): By repeating the same step according to Equation (1), IMF2 can be calculated by Equation (2): Generalizing the procedure, Kth IMF may be calculated from Equation (3): In total, ten IMFs are extracted and observed experimentally. Figures 7-9 show the results of the COPD, pneumonia, and healthy class after implementation of the EMD technique. The EMD technique mines the signals associated with different intrinsic time scales in developing a collection of IMFs. Hence, we can localize any event in the time as well as the frequency domain. It is important to note that the frequency of the normal tracheal lung sound lies in the 60-600 Hz range [33]. IMF-2. IMF-3, IMF-4 is selected after experimentation.      Therefore, LS information in IMF-2, IMF-3, and IMF-4 is the ROI in this research work. Higher IMFs were discarded due to the presence of noise and other lower-frequency components. LS signal is reconstructed by the selected IMF's. The time and frequency domain representation of the LS signal reconstructed from IMF-2, IMF-3, and IMF-4 of raw LS signal is demonstrated in Figures 10 and 11.  (ii) Discrete Wavelet Transform Discrete wavelet decomposition (DWT) is a powerful tool to reduce noise by decomposing a signal. The signal [ ] obtained after the implementation of EMD is further denoised by the application of DWT via Equation (4):   (4): Discrete wavelet decomposition (DWT) is a powerful tool to reduce noise by decomposing a signal. The signal L[n] obtained after the implementation of EMD is further denoised by the application of DWT via Equation (4): where a, b, Φ, and W L represents the scaling factor, translation factor, mother wavelet, and wavelet transformation function of the input time-series L[n], respectively. High-frequency noise can be added similarly as traces of heart sound (HS) become added during the data acquisition of LS. Therefore, DWT is used to ensure the removal of any irrelevant signal artifact from the ROI. According to the sampling frequency of the Nyquist sampling theorem, the sampling frequency of the experimental signal is set as 44.1 kHz. LS is decomposed into level 1 approximation (0 Hz~11,025 Hz) and detail coefficients (11,025 Hz~22,050 Hz) by the mother wavelet. The Coiflets5 wavelet [19] is used as a mother wavelet due to its morphological resemblance i.e. shape with LS signal to perform denoising. The Coiflets5 wavelet is shown in Figure 12. The approximation coefficients contain low frequencies whereas detailed coefficients contain high frequencies. Hard thresholding is applied to the detail coefficients. In the last step, the L[n] signal is reconstructed from approximation and thresholded detail coefficients. Finally, the processed signal is shown in Figures 13 and 14 in the time and frequency domain, respectively. These figures demonstrate the LS signals with minimal presence of high-frequency noise and other irrelevant signal artifacts before the feature extraction stage.
Sensors 2020, 20 , x  11 of 26 where , , Φ, and represents the scaling factor, translation factor, mother wavelet, and wavelet transformation function of the input time-series [ ], respectively. High-frequency noise can be added similarly as traces of heart sound (HS) become added during the data acquisition of LS. Therefore, DWT is used to ensure the removal of any irrelevant signal artifact from the ROI. According to the sampling frequency of the Nyquist sampling theorem, the sampling frequency of the experimental signal is set as 44.1 kHz. LS is decomposed into level 1 approximation (0 Hz ~ 11,025 Hz) and detail coefficients (11,025 Hz ~ 22,050 Hz) by the mother wavelet. The Coiflets5 wavelet [19] is used as a mother wavelet due to its morphological resemblance i.e. shape with LS signal to perform denoising. The Coiflets5 wavelet is shown in Figure 12. The approximation coefficients contain low frequencies whereas detailed coefficients contain high frequencies. Hard thresholding is applied to the detail coefficients. In the last step, the [ ] signal is reconstructed from approximation and thresholded detail coefficients. Finally, the processed signal is shown in Figures 13 and 14 in the time and frequency domain, respectively. These figures demonstrate the LS signals with minimal presence of high-frequency noise and other irrelevant signal artifacts before the feature extraction stage.

Feature Extraction
Features are the major characteristics upon which a classifier distinguishes between the different LS classes. LS is non-stationary by nature. It is the reason that a single feature cannot forecast its nature. A total of 116 features were extracted which includes nineteen time-domain, 12 frequencydomain, 26 cepstral-domain, and 59 texture-based features. The texture features [34] are extracted from the spectrogram of signals (hop length: 10 samples, window length: 20 samples, window type: Hann window, overlap:10 samples). The summary of the extracted features is presented in Table 3.

Feature Extraction
Features are the major characteristics upon which a classifier distinguishes between the different LS classes. LS is non-stationary by nature. It is the reason that a single feature cannot forecast its nature. A total of 116 features were extracted which includes nineteen time-domain, 12 frequency-domain, 26 cepstral-domain, and 59 texture-based features. The texture features [34] are extracted from the spectrogram of signals (hop length: 10 samples, window length: 20 samples, window type: Hann window, overlap:10 samples). The summary of the extracted features is presented in Table 3. The issue of unbalanced data is very common in the field of e-health. It refers to the presence of a huge number of data elements between the various classes. Some several methods or techniques can be used to replicate the data of the minority classes. The adaptive synthetic sampling method (ADASYN) is one of these augmentation data techniques which is used [35]. It helps to balance the data sample of normal and pneumonia subjects with COPD subjects using an appropriate number of samples.

Classification
Classification is performed by the classifier based on the distinct features extracted and selected in the previous stage. In this research, various classifiers were tested to observe the performance of the proposed method. QD classifiers consider that each class has its covariance matrix. Specifically, the predictor variables are not assumed to have common variance across each of the k levels in Y. Mathematically, it is supposed that observation from the kth class is of the form X ∼ N (µ k , Σ k ), where Σ k is a covariance matrix for the kth class. Under this supposition, the classifier assigns an observation to the class for which is the largest via Equation (5): δ k (l) is the estimated discriminant score that the observation will fall in the kth class within the response variable (i.e., default or not default) based on the value of the predictor variable l.
µ k : the average of all the training observations from the kth class. σ 2 : a weighted average of the sample variances for each of the K classes. π k : the prior probability that an observation belongs to the kth class (not to be confused with the mathematical constant π ≈ 3.14159).
Cross-validation is a resampling technique used to assess machine learning systems on a limited data set. Therefore, cross-validation of the system is performed on different folds. Accuracy (ACC), true positive (TPR), false-negative rate (FNR), positive predictive value (PPV), false discovery rate (FDR) are the main parameters on which system performance is measured using Equations (6)-(10): In this research, cross-validation is performed on 5, 10, and 20 folds to demonstrate the system performance. Moreover, 20% hold out validation, and 25% hold out validation is also carried out to authenticate the outcome. All the results are generated using a Core i7 CPU 8 GB RAM, and 1 TB HDD with MATLAB R2018a.

Results and Discussion
After the experimentation with different classifiers in this study, the quadratic discriminant (QD) classifier outperformed with the minimum number of features when compared with the performance of other classification techniques. Table 4 lists the different classifiers which have been tested on the proposed method with different combinations of features. Graphically, Figure 15 highlighted that the proposed technique has outperformed on SVM_FG classification method. It has achieved an accuracy of 99.70%, 99.40%, and 99.20% on different combinations of 85, 97, and 116 features, respectively. On the other hand, the QD classifier has provided a maximum accuracy of 98.60% with 26 features only. Investigation and selection of significant features to accomplish maximum accuracy with minimum features are important. It plays a key role to lower the burden of the computational cost of those features which are not paying a significant role at the classification stage. As the proposed system is simulated to develop a stand-alone embedded system to identify the pulmonary abnormalities in the future so it will help to enhance the robustness of the system. The feature significance is characterized by analyzing the scatter plot and statistics of different features presented in Appendix A. Figures 16 and 17 demonstrate the scatter plot between log energy, and GFCC-5, MFCC-10, and spectral decrease of all classes. In Figure 16, it can be visualized that there exists a minimum correlation between LE and GFCC-5 features of focused LS data. The minimum correlation also reflects the large spread among feature estimates, so a classifier can perform well with features that have less correlation and more spread between them. The same can be observed in Figure 17 regarding MFCC-10 and SDec estimations of focused data. Therefore, it is expected that such features will play a vital role in the classification when fed to the classifier. The significance of other features is analyzed experimentally in the same fashion.  presented in Appendix A. Figures 16 and 17 demonstrate the scatter plot between log energy, and GFCC-5, MFCC-10, and spectral decrease of all classes. In Figure 16, it can be visualized that there exists a minimum correlation between LE and GFCC-5 features of focused LS data. The minimum correlation also reflects the large spread among feature estimates, so a classifier can perform well with features that have less correlation and more spread between them. The same can be observed in Figure 17 regarding MFCC-10 and SDec estimations of focused data. Therefore, it is expected that such features will play a vital role in the classification when fed to the classifier. The significance of other features is analyzed experimentally in the same fashion.   Table 5 lists the significant features that are selected for the classification of pulmonary pathologies after observing their scatter plots. Experimentation on the system's implementation with different classifiers proved the optimum performance of the QD technique. The system achieved 99.7% accuracy with the QD classifier when implemented with 25 features only. It accomplished the achievement of 78.44% feature reduction by using the backward elimination method [36] to reduce the time complexity and unnecessary computation. The mathematical description of the substantial features is provided in Table 6. These features have provided an essential basis for the classification of COPD, pneumonia, and healthy LS. Standard deviation (SD) in the time domain provides the signal information about its spread from the mean. In the frequency domain, spectral standard  Table 5 lists the significant features that are selected for the classification of pulmonary pathologies after observing their scatter plots. Experimentation on the system's implementation with different classifiers proved the optimum performance of the QD technique. The system achieved 99.7% accuracy with the QD classifier when implemented with 25 features only. It accomplished the achievement of 78.44% feature reduction by using the backward elimination method [36] to reduce the time complexity and unnecessary computation. The mathematical description of the substantial features is provided in Table 6. These features have provided an essential basis for the classification of COPD, pneumonia, and healthy LS. Standard deviation (SD) in the time domain provides the signal information about its spread from the mean. In the frequency domain, spectral standard deviation (SSD) provides the spread of signal frequencies from its mean. Variation in signal strength in COPD and pneumonia LS is compared with healthy LS class by the log energy (LE). The peak to peak (PP) value returns the highest and lowest value difference in the LS signal [24]. Spectral skewness (SSkw) estimates the distribution symmetry of the spectral magnitude about their arithmetic mean. It specifies the extent of non-similarities among spectral magnitudes. It has a low value for flat and a high value for the vibrational spectrum. The spectral kurtosis (SK) estimates the resemblance of spectral magnitude distribution with a gaussian distribution. Its main application is to figure out the occurrence of peaks in the LS signal spectrum. The high value of spectral kurtosis is due to high peaks in the spectrum of an LS segment. Spectral roll-off (SRO) quantifies the spectrum concentration. The spectral decrease (Sdec) calculates that how steeper spectral envelope decrease over frequency. The spectral flux (SF) estimates the variation in the shape of the spectrum by calculating the average difference among consecutive STFT frames [16]. MFCC and GFCC belong to the cepstral features class. We have adopted to use it in LS pathologies motivated by its performance in speech recognition and its robust nature in noise reduction [37].

Feature Mathematical Representation
Spectral Decrease (SDec) Mel frequency cepstral coefficient (MFCC) In MFCC, (i) Frame blocking or windowing to get 50 to 60ms. (ii) Performing a discrete Fourier transform (iii) computing logarithm of the signal. (iv) Deforming the frequencies on a Mel scale, followed by applying the discrete cosine transform (DCT). Mel scale is calculated as follows: Mel Scale = 2595 log 10 1 + Graphically, the system accuracy on different classifiers with selected feature sets is demonstrated in Figure 18. It demonstrates that QD showed 99.70% accuracy upon selected features as compared to any other classifier when denoising is performed by WT.
Sensors 2020, 20, x 3 of 26 0% of samples were wrongly predicted as COPD whereas 7% of healthy samples are classified as pneumonia, equivalent to a 7% false-negative rate. In the COPD class, <1% of samples are wrongly identified as normal while 2% of COPD samples are falsely identified as pneumonia cases.    Figure 19 shows the confusion matrix of the proposed diagnosis methodology for the identification of COPD, pneumonia, and healthy LS without denoising. It demonstrates the classification performance of the proposed system for all classes. It can be observed that without denoising the true positive rate of COPD, pneumonia, and healthy LS diagnosis are 98%, 93%, and 82% respectively. In the case of pneumonia, 11%, and 8%, samples are falsely predicted as COPD and healthy class respectively overall depicts an 18% false-negative rate. In the normal or healthy class, 0% of samples were wrongly predicted as COPD whereas 7% of healthy samples are classified as pneumonia, equivalent to a 7% false-negative rate. In the COPD class, <1% of samples are wrongly identified as normal while 2% of COPD samples are falsely identified as pneumonia cases.  Experimentation performed with the DWT denoising technique after selecting the ROI via EMD technique has improved the true positive rate of all classes to greater than 99%. This is demonstrated in Figure 20. It can be seen that from a total of 631 COPD samples, only three samples are falsely identified. Similarly, two normal and one pneumonia samples are wrongly identified by the system from the overall 670 and 636 samples, respectively. Experimentation performed with the DWT denoising technique after selecting the ROI via EMD technique has improved the true positive rate of all classes to greater than 99%. This is demonstrated in Figure 20. It can be seen that from a total of 631 COPD samples, only three samples are falsely identified. Similarly, two normal and one pneumonia samples are wrongly identified by the system from the overall 670 and 636 samples, respectively. 5, 10, 15, and 20 fold cross-validation is performed to authenticate the system performance which can be observed in Table 7. Variation in folds for cross-validation has not affected the system's robustness. Almost the same accuracy has been achieved in each case. The hold-out technique is also performed to validate system performance. The system achieved 99.70% and 99.80% when the 20% and 25% hold out validation method is implemented. In the 20% hold-out method, data is divided into two portions for training and testing i.e. 80% and 20%. 80% of the data is used for 5, 10, 15, and 20 fold cross-validation is performed to authenticate the system performance which can be observed in Table 7. Variation in folds for cross-validation has not affected the system's robustness. Almost the same accuracy has been achieved in each case. The hold-out technique is also performed to validate system performance. The system achieved 99.70% and 99.80% when the 20% and 25% hold out validation method is implemented. In the 20% hold-out method, data is divided into two portions for training and testing i.e. 80% and 20%. 80% of the data is used for training whereas 20% of data is used for testing. this gives a better indication of how well the proposed method performs on unseen data. Different studies have been performed on the classification of pulmonary diseases. The proposed technique has outperformed in comparison with various published techniques to identify the pulmonary abnormalities from LS analysis in terms of the number of features, accuracy sensitivity, and specificity of the method.
According to the best of our knowledge, there is a single previous research work on COPD and pneumonia LS, but the main aim of that study was to analyze the acoustic parameters from respiratory signals in COPD and pneumonia patients. These parameters were not further utilized by the researcher to classify the COPD and pneumonia LS [20]. Nevertheless, the number of extracted features in some existing research works is less than the proposed technique but it can only identify a single pulmonary illness from LS analysis [6,12,18,25,26]. Details are presented in Table 8. Regarding the comparative study and performance evaluation, there should be similar methods, but even the most recent literature mainly focused on COPD, pneumonia, and healthy LS identification is limited. That's the reason the comparison of this research work with similar methods is difficult, although we believe this research represents an innovative step towards screening of COPD, pneumonia, and healthy LS. The proposed technique has provided an efficient approach with outstanding classification accuracy. It has outperformed as compared to existing techniques on other multiple pulmonic pathologies from LS analysis due to its simple statistical features, low computation, and accuracy [6,7,12,20,22,24,25,28,29]. The performance analysis of the proposed diagnostic technique with existing pulmonic pathologies methods is shown in Table 9. As mentioned, these techniques focused on different combinations of lung disease and there is a lack of research in which LS classification is performed for COPD and pneumonia identification only. Therefore, the presented research work is a progressive and fruitful initiative to explore the use of LS for the identification of normal, COPD, and pneumonia subjects. Table 9. Performance analysis of the proposed diagnosis methodology for COPD and pneumonia identification with current techniques on lung pathologies.

Conclusions
In this article, an efficient method is proposed for the classification of pulmonary pathologies from LS analysis. It provides a non-invasive, convenient to use, and low-cost solution for diagnosis of COPD and pneumonia issues which are one of the leading causes of death worldwide. The research work is based on the ICBHI open access LS database which endorses the authenticity of the focused data in the development of the proposed method. ROI extracted through EMD and further denoising via DWT has proven the best approach to unfold the main information required for feature extraction, verified by the statistical analysis of extracted features. Feature reduction through the back elimination method and experimentation on different feature sets using various classification techniques has established the best performance of the quadratic discriminant classifier with 25 features only. The proposed study has performed classification with 99.7% accuracy, TPR>99%, and FNR<1%. The system cross-validation result on different folds as well as hold out validation method substantiates the reported accuracy. The use of simple statistical features also ensures the minimal computational cost of presented research work. The proposed method is purely based on automated LS analysis. No clinical information is needed like other techniques based on knowledge graphs. Although various studies have been found on different lung diseases, to the best of our knowledge, we have found a lack of research targeting particularly the solution for COPD and pneumonia issues via signal processing and machine learning approach. Therefore, the proposed research work is a progressive and innovative approach to the diagnosis of COPD and pneumonia. The proposed method will help to monitor pulmonary health in COPD and pneumonia patients. It is ready to embed approach due to the simple technique. Furthermore, it is a promising method to assist pulmonologists as a counterpart to their clinical diagnosis. Implementation of the proposed technique on specific hardware and designing a stand-alone portable system for pulmocare can be done to bring this research one step ahead. The performance analysis of the presented research work needs authentication on self-collected data.