Diagnosing Various Severity Levels of Congestive Heart Failure Based on Long-Term HRV Signal

: Previous studies have attempted to ﬁnd autonomic di ﬀ erences of the cardiac system between the congestive heart failure (CHF) disease and healthy groups using a variety of algorithms of pattern recognition. By comparing previous literature, we have found that there are two shortcomings: (1) Previous studies have focused on improving the accuracy of models, but the number of features used has mostly exceeded 10, leading to poor generalization performance; (2) Previous works rarely distinguish the severity levels of CHF disease. In order to make up for these two shortcomings, we proposed two models: model A was used for distinguishing CHF patients from the normal people; model B was used for diagnosing the four severity levels of CHF disease. Based on long-term heart rate variability (HRV) (40000 intervals–8h) signals, we extracted linear and non-linear features from the inter-beat-interval (IBI) series. After that, the sequence forward selection algorithm (SFS) reduced the feature dimension. Finally, models with the best performance were selected through the leave-one-subject-out validation. For a total of 113 samples of the dataset, we applied the support vector machine classiﬁer and ﬁve HRV features for CHF discrimination and obtained an accuracy of 97.35%. For a total of 41 samples of the dataset, we applied k-nearest-neighbor ( K = 1) classiﬁer and four HRV features for diagnosing four severity levels of CHF disease and got an accuracy of 87.80%. The contribution in this work was to use the fewer features to optimize our models by the leave-one-subject-out validation. The relatively good generalization performance of our models indicated their value in clinical application.


Introduction
Heart failure is the disability of the heart in pumping blood to the body efficiently [1]. Many previous works have confirmed that autonomic imbalance is the leading cause of congestive heart failure (CHF) disease. The autonomic nervous system (ANS) consists mainly of two branches, the sympathetic nervous system (SNS) and the parasympathetic nervous system (PNS) [2,3]. For the subjects in the normal group, the activities of two nervous systems are in a state of dynamic equilibrium. However, studies have shown that CHF breaks this equilibrium state due to excessive sympathetic activity [4]. The New York Heart Association (NYHA) has classified CHF into four scales based on the severity of the disease, NYHA I, II, III, and IV, respectively. Some papers have classified patients with mild heart failure (NYHA I or II) as low-risk patients and patients with severe heart failure (NYHA III or IV) as high-risk patients [5]. For patients with mild heart failure, the excessive activation of SNS and the continuous reduction of PNS activity would lead to an extreme imbalance of autonomic nerves if there is no suitable treatment, which would inevitably lead to further deterioration of the disease [5].
Heart rate variability (HRV), which represents the variation of inter-beat-interval, provides very useful information on the dynamics of heartbeat behavior [6,7]. Previous researchers have typically extracted linear and nonlinear HRV features from the inter-beat-interval (IBI) signals. The linear features are the signal parameters in time-domain, frequency-domain, and time-frequency domain [7][8][9]. In order to reveal other valuable information contained in the IBI signal, some researchers have used the complexity analysis method to extract nonlinear indicators [10][11][12][13][14][15][16][17][18][19][20]. In 2003, Asyali et al. [21] applied Bayesian classifiers and nine long-term measurements for CHF discrimination with an accuracy of 93.24%. In 2007,İşler et al. [22] utilized eight features of wavelet entropy and classical short-term HRV measurements and k-nearest-neighbor classifiers for CHF diagnosis and achieved an accuracy of 96.39%. In 2012, Yu et al. [23] applied support vector machine (SVM) classifier and genetic algorithm (GA) for CHF recognition and achieved an accuracy of 98.79%. In their work, 16 features by using bi-spectral heart rate (HR) analysis of short-term measurements were applied for CHF recognition [23]. In 2016, Acharya et al. [24] applied empirical mode decomposition (EMD) for automated identification of congestive heart failure. Using 22 short-term HRV measurements with support vector machines, they achieved an accuracy of 97.64% [24]. In 2017, Mahajan et al. [25] applied Ensemble classifiers and 10 short-term HRV measurements for CHF discrimination with an accuracy of 98.10%.
In 2013, Melillo et al. [5] tried to assess the severity levels of CHF disease by using long-term HRV measurements. The classification and regression tree (CART) classifier were used to separate lower-risk patients from higher-risk patients with an accuracy of 85.4%. In 2015, Shahbazi et al. [26] proposed a generalized discriminant analysis method, applied k-nearest-neighbor classifier for CHF risk assessment based on long-term HRV, and obtained an accuracy of 100%. In 2016, Chen et al. [27] built a four-level risk assessment model for CHF detection and quantification based on the DT-SVM (decision tree based support vector machine) algorithm. The model obtained a total accuracy of 96.61% in risk assessment among individuals of N (no risk), P1 (mild risk), P2 (moderate risk), and P3 (severe risk) [27]. In that paper, they used 180 features, including 126 dynamic measurements and 54 static measurements [27]. In 2019, Li et al. [28] proposed a four-stage classification problem using an end to end deep model, which extracted 20 features by convolution at max-pooling layers, and an accuracy of 97.6% was achieved.
From the previous work above, we have found that there are two points which need to be further explored: 1) They have focused on improving the accuracy of models, but the number of features in their pattern recognition models mostly exceeds 10 [23][24][25]27,28]. However, the limited number of features improves certainly the interpretability of the classification; 2) Compared with the diagnosis of the severity levels of CHF, researchers were more inclined to diagnose CHF patients from normal people. For CHF risk assessment, their research was the consolidation of multiple NYHA scales into one category for identification [5,[26][27][28]. Therefore, researchers rarely distinguished the four severity scales of CHF disease clearly, which is a four-class classification problem. In view of the above-mentioned reasons, we proposed two models based on long-term HRV (40,000 HRV intervals-8h): model A was used for distinguishing CHF patients from the normal people (binary classification); model B was used for diagnosing the four severity levels of CHF disease (four-class classification). Detailed processes of building the two models are described in the following sections.

Data Collection
In previous studies, the RR interval time series (RRITS) was often used as the data source for analyzing HRV measures [21][22][23][24][25][26][27]. The RRITS was calculated as the interval between two consecutive R peaks of heartbeat in the electrocardiographic (ECG) signals. The data we studied were all from the RRITS database in the PhysioBank server. The data can be downloaded online from http://www.physionet.org/cgi-bin/atm/ATM for free. Each recording in the RRITS database was manually reviewed and corrected by experts. Supported by the National Research Resource Center of the National Institutes of Health, the PhysioBank server is a joint project involving many research and medical institutions [29,30].
As shown in Table 1, our study used a total of four RRITS databases. A total of 116 24-hour RR interval recordings were included in the dataset of 72 normal subjects and 44 CHF subjects. The data of normal subjects comes from two databases: the MIT/BIH Normal Sinus Rhythm Database (Nrsdb) and the Normal Sinus Rhythm RR Interval Database (Nrs2db) [29]. The data of CHF patients were obtained from the BIDMC Congestive Heart Failure Database (Chfdb) and the Congestive Heart Failure RR Interval Database (Chf2db) [30]. Due to the lack of recordings of NYHA IV CHF patients in the CHF RR database, we labeled NYHA III-IV CHF patients as NYHA IV CHF patients in the database.

Data Preprocessing
The flowchart of the data processing is given in Figure 1. We first downloaded the RRITS signal from the public database. However, we found that RRITS included too many ectopic heartbeat intervals that cover up important physiological information [5]. According to beat annotations files provided from the database, we removed the ectopic intervals of each RRITS signal with manual review and correction. Therefore, the NN interval time series will be obtained by removing all ectopic intervals, such as ventricular ectopic, supraventricular ectopic, or unknown. In order to ensure the reliability of the signal, the ratio of the length of the NN interval time series to the length of the RRITS (NN/RR) was calculated. The ratio was used as a measure of data reliability to exclude the recordings where the ratio was less than the threshold. In general, the ratio is set to 80% because it is a satisfactory compromise between the number of subjects included in the research work and the NN signal quality [5]. Based on the above excluding criterion, one recording was excluded from each of NYHA II, NYHA III, and NYHA IV CHF groups, respectively. Finally, we randomly selected continuous 40,000 NN intervals as one segment for each recording. As shown in Table 2 It was necessary to specify that in order to make the HRV signals more stable and homogeneous when extracting the frequency domain features, some previous researchers removed the linear trend of the HRV signal and used the cubic-spline interpolation method to re-sample it at a rate of 4 samples per second [22,23,31,32]. However, we applied the Lomb-Scargle (LS) periodogram proposed in [31,32] to calculate the power spectrum. In this case, HRV signals did not require resampling in our research [33,34].  It was necessary to specify that in order to make the HRV signals more stable and homogeneous when extracting the frequency domain features, some previous researchers removed the linear trend of the HRV signal and used the cubic-spline interpolation method to re-sample it at a rate of 4 samples per second [22,23,31,32]. However, we applied the Lomb-Scargle (LS) periodogram proposed in [31,32] to calculate the power spectrum. In this case, HRV signals did not require resampling in our research [33,34].

Methods
After preprocessing the HRV signals, we extracted linear and nonlinear features. Then, the sequence forward selection (SFS) algorithm was used to reduce the dimension of the feature space.

Methods
After preprocessing the HRV signals, we extracted linear and nonlinear features. Then, the sequence forward selection (SFS) algorithm was used to reduce the dimension of the feature space. Finally, a variety of classification algorithms were used for building model A and B, and the models with the best performance were selected.

Linear Feature
Features extracted in the time-domain and frequency-domain are listed in Table 3 [7,8]. We extracted 8 time-domain features in this study. Frequency-domain HRV measures were analyzed based on the estimation of the power spectrum (PSD) through the LS method. After estimating PSD, three main spectral components were distinguished from the spectrum: very low frequency (VLF), low frequency (LF), and high frequency (HF) components. These frequency bands were bounded with the limits 0-0.04, 0.04-0.15, and 0.15-0.40 Hz, respectively [7]. To a certain extent, frequency-domain HRV measures reflected the activity of the autonomic system. As described in Table 3, we extracted 8 frequency-domain HRV features.
In addition to the classical HRV measures, we also used time-frequency domain measures based on the wavelet theory. HRV signal fluctuates in an irregular and complex manner [35]. Quantification power of the features based on time-domain and frequency-domain analysis is limited when analyzing non-stationary HRV signals. Therefore, we characterized and quantified the non-stationary properties of physiological signals based on wavelet theory. We used the mother wavelet 'db4' to perform a 4-scale decomposition. From the low frequency to high frequency, we got 5 frequency bands, and the wavelet coefficients of 5 bands were CA4, CD4, CD3, CD2, and CD1, respectively. Then, the mean and the standard deviation of the coefficients of each band were calculated [9,36]. Finally, we extracted 10 features based on the time-frequency domain.

Non-Linear HRV Features
In this paper, non-linear indicators were obtained by local Hurst exponent analysis based on the wavelet theory, detrended fluctuation analysis (DFA), entropy analysis, and Poincare plot analysis.
Biomedical signals are generated by complex self-regulating systems that process inputs with a wide range of characteristics [10,11]. Previous studies have shown that HRV signals with healthy people are a multi-fractal signal. However, the HRV signal with CHF patients is approximately a single fractal signal [10,11]. The fractal properties are characterized by ∆h (∆h = h max − h min , h is the Hurst exponent) [11]. In 1999, Ivanov et al. [11] evaluated the local Hurst exponent h through the modulus of the maxima values of the wavelet transform at each point in the time series and scales a = 2 × 1.15 i , i = 10,... 41. They have found the Hurst exponent has a great ability to distinguish CHF patients from normal people [11]. We analyzed the fractal characteristics of the subjects based on long-term HRV signals. The scales in our research were consistent with other papers, i.e., a = 2 × 1.15 i , i = 10,...41.
DFA is often used to quantify HRV signals. This technique is a modification of root-mean-square analysis of random walks applied to non-stationary signals [12,13]. In this paper, we extracted two features using the DFA algorithm: α 1 was obtained by least square fitting of F(n) and small scale n (4 ≤ n ≤ 16); α 2 was obtained by least square fitting of F(n) and larger scale n than 16.
Sample entropy (SampEn) and fuzzy measure entropy (FuzzyMEn) are often used as measures for the analysis of the complexity of HRV signals [14][15][16][17]. Usually, SampEn and FuzzyMEn are influenced by the parameters of embedding dimension m and tolerance threshold r [16,17]. In this paper, the best combination of m and r for SampEn and FuzzyMEn were obtained by statistical significance analysis.
The Poincare plot can reflect the nonlinear characteristics of the nature of the HRV signal. It is a graph of each RR (or NN) interval against the next interval [18,19]. In this article, we extracted three features based on the Poincare diagram analysis, i.e., SD1 (the width of the Poincare plot), SD2 (the length of the Poincare plot), and SD1/SD2 [20].

Feature Selection
We extracted 34 features, including 26 linear features and 8 non-linear features in our research. For high-dimension feature sets, building models directly has a large computation cost [37]. On the contrary, the limited number of features improves certainly the interpretability of the classification [37]. Therefore, it is necessary to use the feature selection method to find the best feature subset. This optimal feature subset uses the least number of features to get the best performance of the classifier [37]. In this paper, we used the SFS algorithm for reducing the feature dimension. SFS algorithm is a bottom-up search procedure, which starts from an empty feature set and adds one feature at each iteration step by using some evaluation functions [38,39]. In this paper, we used the classification performance of the classifier as the evaluation function of the feature selection process.

Classification
In order to build models with the best performance, our work applied a variety of classification algorithms, including support vector machine (SVM), linear discriminant analysis (LDA), k-nearest-neighbor (KNN), decision tree (DT), and Bayesian (NB). Relevant principles of classification algorithms can be referred to the previous studies [21][22][23][24][25][26][27]. The kernel function for SVM classifier was a radial basis function. We used the five algorithms mentioned for building model A and B. All features were normalized between 0 and 1 using the min-max method proposed in [22,32] before classification.
The performance of different classifiers was evaluated by the indicators of precision (Prec), sensitivity (Sens), specificity (Spec). The calculation of these three evaluation indicators refers to Equations (1)-(3). Besides, AUC, i.e., the area under the ROC (receiver operating characteristic) curve, was also simply calculated by Equation (4) [5,40]. To compute these estimators, true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) should be measured first. For the four-class classification, we divided the problem into four 2-class classifications. In each 2-class problem, we viewed one kind of NYHA scales as a positive case and the other three NYHA scale as negative cases.
After building the models, there are usually two ways to validate the performance of the models: Cross-subject validation [36,41]: If the dataset has N subjects, and one subject provides n samples, this validation uses all the samples (m × n samples for the test set) from m different subjects as the test set, and uses all the remaining samples ((N−m) × n samples for training set) to train the classifier. This way one can obtain M accuracies by training M models if the m subjects selected each time are different from each other. Then, the average of M accuracies as the final accuracy of models is calculated.
Intra-subject validation [36,41]: The common ten-fold cross-validation is this way. Because this method of validation requires random scrambling of samples, it will result in multiple samples for each subject being present in both the test set and the training set.
Obviously, compared with cross-subject validation, the intra-subject validation usually obtains a higher recognition rate [24,42]. However, the model using intra-subject validation does not predict well for unknown data. On the contrary, this shows that the models using cross-subject validation and obtaining accuracy have a great application value. Since each subject in our study provided a segment, the leave-one-subject-out was cross-subject validation for evaluating models [22,23,32,40].

1.
The nonlinear feature α 1 was extracted based on the DFA algorithm, and the small scale n ranges were from four to 16. Different ranges of the large scale n (n ≥ 16) show very different quantification power for HRV signals. In order to find an optimal scale range for large scale n, we analyzed the statistical difference of the feature α 2 between the normal and CHF groups. The span of the scale range is 48, and the length of the sliding step is four.
As shown in Table 4 and Figure 2, α 2 was significantly different between the normal and CHF groups at the n values of 16 ≤ n ≤ 64, 56 ≤ n ≤ 104, 60 ≤ n ≤ 108, with 16 ≤ n ≤ 64 corresponding to the smallest p-value (p = 4 × 10 −4 by Student's t-test, marked red in Figure 2). Therefore, we chose 16 ≤ n ≤ 64 for the calculation of feature α 2 .     Table 5 gives the statistical F-test results of the feature α 2 with various scales for CHF disease with four severity levels. The F-test is generally used for the significant test between multiple groups. According to the statistical results, the feature α 2 was not statistically significantly different in all scales for CHF diseases with different severity levels. The range of scale from 16 to 64 was relatively better than other ranges and was applied for calculating feature α 2 when building model B. 2. Different parameters (m and r) of sample entropy (SampEn) and fuzzy measure entropy (FuzzyMEn) also lead to different quantification power of HRV signals. In order to find the best combinations of parameters, the parameter m changed from one to three with a step of one, and r changed from 0.10 to 0.20 with a step of 0.05. For the normal and CHF groups, Table 6 gives the statistical Kolmogorov-Smirnov (KS) test results of SampEn and FuzzyMEn with a different combination of parameters. For CHF diseases with different severity levels, Table 7 gives the statistical F-test results of SampEn and FuzzyMEn with different combinations of parameters. As shown in Table 6, m = 1 and r = 0.15 obtained the SampEn values with the most significant statistical difference (p = 1 × 10 −3 by KS test) between the two groups; m = 1 and r = 0.20 got the FuzzyMEn values with the most significant statistical difference (p = 9 × 10 −7 by KS test) between the two groups. Therefore, m = 1 and r = 0.15 was set for SampEn, and m = 1 and r = 0.20 for FuzzyMEn in model A. Similarly, based on F-test results in Table 7, m = 1 and r = 0.10 was set for SampEn and FuzzyMEn in model B.  Table 8 shows the leave-one-subject-out validation performance of model A for different classification algorithms. The highest accuracy of 97.35% was achieved by using both SVM and KNN (K = 1) classifiers. However, the SVM used fewer features (number of features (NF) = 5) to achieve the same accuracy of KNN (K = 1). In order to better apply model A to the clinical diagnosis of CHF disease, we tried our best to make CHF patients not misdiagnosed as normal subjects. Therefore, with the same accuracies of SVM and KNN (K = 1), the sensitivity indicator was an important index. For the sensitivity and AUC indicators, the performance of the SVM was (Sensitivity = 97.56% and AUC = 0.963) better than the performance of KNN (K = 1, Sensitivity = 95.12%, and AUC = 0.959). Therefore, considering the better NF, sensitivity, accuracy, and AUC, our work used the SVM classifier to build model A.

Model B
After accurately diagnosing CHF disease, it was necessary to categorize the four severity levels of the CHF disease. Considering that NYHA I CHF patients only have four recordings, the value of K cannot be too large for KNN algorithm. In our research, the value of K equaled 1, 3, and 5, respectively. Table 9 briefly shows the leave-one-subject-out validation parameters of model B for each classification algorithm. We applied KNN (K = 1) classifier and four features for building model B with the highest accuracy of 87.80%. The selected best features were seen to be PNN20, PNN50, TOTPWR, and FuzzyMEn, respectively. In order to further analyze the performance of the model, we divided the four-class classification problem into four binary classification problems. In each binary classification problem, one severity level of CHF disease was considered to be the target class, and other severity levels of CHF disease were considered to be the negative class. Table 10 shows the leave-one-subject-out validation performance of the KNN (K=1) classifier for building model B. For mild CHF disease (CHF I-II), the values of precision and sensitivity are smaller than those of severe CHF disease (CHF III-IV). The reason is that the number of mild CHF subjects is too small.   Table 11 highlights the results in previous related studies that used HRV signals as the data source for analysis. The accuracy of models built by previous researchers has exceeded 95%, which indicated that there is a big difference in the cardiac system between the normal and CHF groups. However, the number of features they used mostly exceed 10. In contrast, we applied the SVM classifier and only five features for model A with an accuracy of 97.35%. We used a larger sample size to build our model A to make it more stable than previous studies. Previous work is more inclined to find the difference between the normal and CHF groups. In Table 11, researchers have evaluated the different severity levels of CHF disease, and a large number of features were used for binary and four-class classifications [27,28]. However, we applied the KNN (K = 1) classifier and four features for diagnosing the four severity levels of CHF disease with an accuracy of 87.80%. In 2016, Acharya et al. [24] divided the RRITS of each recording in the database into a segment with 2000 intervals and applied the SVM classifier and 22 features for CHF discrimination with an accuracy of 97.60%. In 2018, Li et al. [42] divided the RRITS of each recording in the database into a segment with 300 intervals and applied the CNN classifier and one feature for CHF discrimination with an accuracy of 81.85%. In these two papers, one subject provided more than one RRITS segments in the database, and the use of ten-fold cross-validation to evaluate the performance of classifier means that the data of the training set and the data of the test set may come from the same subject. That was to say, the evaluation of these two models was intra-subject and not cross-subject validation. Because of large similarities between multiple segments of the same subject, intra-subject validation of evaluating models tended to achieve an over-optimistic accuracy. However, the two models we built were evaluated using more robust cross-subject validation, and a higher accuracy was obtained.

Limitations of This Study
We used 113 subjects to build the two proposed models. As we can see from Tables 1 and 2, the number of NYHA I and NYHA II CHF subjects was significantly smaller than that of NYHA III and NYHA IV CHF subjects. For the lack of recordings of NYHA IV CHF patient in the CHF RR database, we labeled NYHA III-IV CHF patients as NYHA IV CHF patients in the database. Obviously, the small number and imbalance of the dataset will make it difficult to further improve the accuracies of the classifiers [44,45]. Besides, due to the relatively small amount of recordings, we only leave one subject for cross-validation. As long as our dataset is small and unbalanced, it is impossible to confirm the generalization of our results unless a larger public dataset is available [44,45]. Therefore, we still need to validate our models with larger dataset in the future.

Conclusions
This paper proposed two models for automatic CHF diagnosis. Model A was used for distinguishing CHF patients from normal people (binary classification), and model B was used to diagnosing four severity levels of CHF diseases (four-class classification). Model A applied the SVM classifier and five HRV indicators (TOTPWR, PLF/PHF, SD1, α 2 , and SampEn) with an accuracy of 97.35%. Model B applied the KNN (K = 1) classifier and four HRV indicators (PNN20, PNN50, TOTPWR, and FuzzyMEn) with an accuracy of 87.80%. The advantage of model A and B was the use of a small amount of features while ensuring that the model has high accuracy. This advantage indicated the application value of our models in clinical practice.