Sleep Quality Evaluation Based on Single-Lead Wearable Cardiac Cycle Acquisition Device

In clinical conditions, polysomnography (PSG) is regarded as the “golden standard” for detecting sleep disease and offering a reference of objective sleep quality. For healthy adults, scores from sleep questionnaires are more reliable than other methods in obtaining knowledge of subjective sleep quality. In practice, the need to simplify PSG to obtain subjective sleep quality by recording a few channels of physiological signals such as single-lead electrocardiogram (ECG) or photoplethysmography (PPG) signal is still very urgent. This study provided a two-step method to differentiate sleep quality into “good sleep” and “poor sleep” based on the single-lead wearable cardiac cycle data, with the comparison of the subjective sleep questionnaire score. First, heart rate variability (HRV) features and ECG-derived respiration features were extracted to construct a sleep staging model (wakefulness (W), rapid eye movement (REM), light sleep (N1&N2) and deep sleep (N3)) using the multi-classifier fusion method. Then, features extracted from the sleep staging results were used to construct a sleep quality evaluation model, i.e., classifying the sleep quality as good and poor. The accuracy of the sleep staging model, tested on the international public database, was 0.661 and 0.659 in Cardiology Challenge 2018 training database and Sleep Heart Health Study Visit 1 database, respectively. The accuracy of the sleep quality evaluation model was 0.786 for our recording subjects, with an average F1-score of 0.771. The proposed sleep staging model and sleep quality evaluation model only requires one channel of wearable cardiac cycle signal. It is very easy to transplant to portable devices, which facilitates daily sleep health monitoring.


Introduction
Sleep is a vital activity of humans. However, it is difficult to measure accurately. In clinical conditions, polysomnography (PSG) and sleep questionnaires are usually used as objective and subjective standards, respectively, for the assessment of sleep quality. PSG is useful in the diagnosis and treatment of sleep disorders [1]. With it, sleep stage transitions and automatic nervous system activities in sleep can be observed by multichannel physiological electric signals. The up-to-date interpretation of PSG by the American Academy of Sleep Medicine (AASM) divides all sleep time into five stages: stage of wakefulness, stages 1-3 of non-rapid eye movement (stage N1-N3 in NREM) and a stage of rapid eye movement (REM) [1,2]. This categorization helps the diagnoses of some sleep diseases; however, the PSG's defects of inconvenience in daily use and interference with sleep are obvious and difficult to resolve. The sleep questionnaire is a subjective and rough assessment of the subject's overall sleep quality. Both PSG and sleep questionnaires are essential tools to obtain an individual's sleep health information; however, the connections between them are not so significant or coincident and cannot be indicated by simple mathematical expressions. Researchers have found that in healthy adults, subjective total sleep time is Sensors 2023, 23, 328 2 of 20 overrated compared with objective total sleep time, as well as subjective sleep efficiency [3]. Sleep stage N2 is closely correlated with subjective sleep quality and neither sleep efficiency nor total sleep time can be a predictor of subjective sleep quality. However, in older people, objective sleep efficiency is the strongest correlate of subjective sleep quality, but is inconsistent with that of younger adults. The restoration from the sleep of younger adults relates to slow-wave sleep closely [3,4].
Standard PSG monitoring obtains multiple signals such as electroencephalogram (EEG), electrooculogram (EOG), electrocardiogram (ECG), mouth and nose breathing and chest breathing. Among them, EEG is the most highly correlated with sleep staging results. Because PSG monitoring is complicated work with many wires, usually at least 12 leads, it interferes with patients' normal sleep, so technical needs exist to simplify the PSG method without too much loss of accuracy in sleep staging. In the research field, by extracting one or several channels of PSG signals, sleep staging algorithms are commonly based on EEG [5], ECG [6,7] and cardiopulmonary coupling (CPC) [8]. Despite the fact that the algorithms based on EEG generally perform much better than those based on ECG and CPC [6,9,10], the discomfort and noisy signal of EEG are its bottleneck. We need reduced connecting lines and fewer foreign body sensations to make sure that sleep is not disturbed by devices and that the data more closely resemble natural sleep.
Many research studies have verified the possibility of sleep staging by ECG signals [6,8,11,12]. ECG signals can be much more easily obtained than EEG signals. We only need to collect one lead of wearable ECG signals and that is very convenient and offers small disturbance to subjects. The signals can be obtained by the portable devices with two or three electrodes pasted on the surface of the subjects' chest or abdomen, with no wires around the body. Heart rate variability (HRV) is the physiological phenomenon of variation in the time interval between heartbeats. It is measured by the variation in the beat-to-beat interval. The data source used to calculate HRV is usually ECG or photoplethysmography (PPG); the difference between them is negligible [13,14]. Many researchers have introduced large numbers of HRV features and complicated neural networks into their sleep staging models, which is very time consuming and short of practical value [6,[15][16][17][18], and this need to be altered and improved.
In practice, the demand exists to evaluate subjective sleep quality with a few physiological signals for ordinary persons and for special post staff. The special post staff are those whose working hours are lengthy, working content is difficult and the accommodation is rough, such as soldiers, firefighters, pilots and construction workers. They must be energized before performing a special task, which requires them to have a good sleep. Thus, it is better for their sleep quality to be supervised objectively by the leaders. However, it is not a wise option to have them fill out the sleep quality questionnaires because some unaware or intended false answers might be included. We should resort to some technical means to obtain sleep information objectively without significantly affecting the staff's normal sleep. Generally, for healthy people, how they really feel about their sleep is considered to be credible and acceptable. The score of their subjective sleep questionnaires can be thought to be the standard of sleep quality. The aim of this paper is to develop a practical method to differentiate subjective sleep quality into two groups-"good" and "poor" using single-lead wearable cardiac cycle signals. To achieve this, we collected ECG data and sleep questionnaires from 200 employees who work in special positions when they sleep at night. We should first build a sleep staging model as the bridge. We strive to screen out those whose sleep quality is poor and demand that the algorithm does not consume too many hardware resources and can be easily transplanted into portable devices.

International Public Dataset
Our sleep staging model is built mainly based on the international public database and the following two datasets were used: (1) the Sleep Heart Health Study Visit 1 database (SHHS1DB) [19,20], (2) [15,21]. The SHHS1DB includes the PSG data of 6441 individuals; the participants have no history of treatment of sleep apnea, no tracheostomy, and no current home oxygen therapy. The CincDB is from a challenge whose goal is to classify target arousal regions in the sleep time at night using PSG data. In the training set of CincDB, 994 subjects were included, some of whom were diagnosed with sleep disorders. All the data we used are annotated in 30 s contiguous intervals: wakefulness, stage N1, stage N2, stage N3 and REM. We pick out only one channel of ECG signals from PSG data. We selected all the subjects in CincDB and the same number of subjects in SHHS1DB randomly.

Subjects
Two hundred healthy male subjects participated in this experiment. Their age ranged from 18 to 35. When they slept habitually in their bunks at night, each person was asked to paste a tiny device manufactured in our laboratory on their chest to record ECG signals. The size of the device is 87 × 54 × 16 mm, small enough that it cannot interfere with subjects' normal sleep. There are two electrodes on the back side of the device. The sample frequency is 250 Hz and it can work continuously for 24 h. The storage is enough for 20 days' record. The experiment lasted for one night. All the staff were required to fill in the sleep questionnaire the next morning as soon as they woke up.
The subjective sleep questionnaire was modified from the widely used Pittsburgh sleep quality index (PSQI) questionnaire [16]. A few questions relevant to the professional characteristics of the special position staff were added to the questionnaire and the questions related to general conditions of the past month's sleep quality were replaced by the questions reflecting one night's sleep conditions or deleted. The total score of our questionnaire ranges from 0 to 22 and the lower the score, the better the sleep experience. There are 7 components in the scoring system, 6 components ranging from 0-3, and 1 component ranging from 0-4. The questionnaire and the scoring system of it are contained in Appendix A. It should be considered that the special position staff are young people, most of them sleep well in normal conditions and they never use medicines to help them sleep. After being assessed by experts, the score below or equal to 6 indicates "Good" and above 6 indicates "Poor" in the grading criterion. In our experiment, subjects were divided into 2 classes-"good sleep" and "poor sleep".
The study was approved by the Ethics Committee of Zhongda Hospital, affiliated to Southeast University, China. All the subjects were well informed of the aims and risks of the experiment and they all signed informed consent before the study.

Method
Sleep staging model building is the first step to implement sleep quality evaluation. The second step is to construct a sleep quality evaluation model as a bridge of objective sleep staging features and subjective sleep questionnaire scores.

Sleep Staging Model
The sleep staging model was built based on the international public dataset. Before model building, we divided the dataset into two parts: half of each database was the training set and the other half was the test set. For obvious absent data (data less than 5 h in length), data error, those with obvious sleep disease (whose Apnea-Hypopnea Index (AHI) is equal or more than 5), objective sleep efficiency less than 75%, and if the recording time began too early (over 45 min at the beginning is wake stage) or ending too late (over 30 min at the end of the data is wake stage), the subjects' data were discarded.

Data Preprocessing
The R waves of ECG signals were automatically identified by the P&T method for its extensively tested accuracy and efficiency [22]. According to the calculation rules of HRV parameters, abnormal heartbeats, such as wide gap longer than 2 s, narrow gap shorter than 0.35 s and irregular adjacent gaps, were removed automatically. If the wake time at the beginning of each subject's signal is longer than 5 min, the extra signal before the last 5 min in the wake time is cut off. The sleep stages were annotated every 30 s. For each 30-s data segment, the 4 min 30 s data centered on it is the basic unit for calculating HRV features; the segments with more than two kinds of sleep stages were not included.

1.
Candidate features that represent the characteristics of the data segment itself The sleep staging model is based on the transformation of different HRV features in the whole night. Each HRV feature is calculated every 4.5 min. Thus, to build the model we should first select proper HRV features and calculate their values. In our work, some frequently used features in research and clinical conditions and two-time irreversibility features were taken into consideration in the feature selection procedure. We examined how each feature and each combination of features correlate with the transformation of sleep stages by analysis of variance. The 23 features we picked out are listed in Table 1. HR, SDNN, RMSSD, SDSD, PNN50, LF, HF, VLF, LF/HF ratio, LFn and HFn are linear HRV features. Among them, feature coRR, resf and strf are cited from Heenam Yoon's work [23] and they are effective in the recognition of the REM stage. Feature α1 and α2 are the slope and offset of Detrended Fluctuation Analysis (DFA) respectively. The DFA method was first proposed by Peng et al. in 1994 when they studied the structure of DNA molecular chain [24]. DFA analysis first removes trends caused by external factors in the sequence, and then studies the long-range correlation of the sequence. DFA analysis can eliminate trend components of all orders, deal with those time series with long-range correlation and non-stationary data, and eliminate the pseudo-correlation phenomenon. P 1 and G 1 are time irreversibility parameters that measure time imbalance of a RR intervals sequence [25][26][27][28]. The difference value between adjacent RR intervals could be positive, negative and zero. The positive and negative ones are expressed by ∆RR + , ∆RR − and their number are shown as N(∆RR + ),N(∆RR − ). The two features are defined in Formulas (1) and (2).

RMSSD
The root mean square of successive differences.

SDSD
The standard deviation of the difference of the adjacent NN intervals.

PNN50
The number of times in which the change in successive normal sinus intervals exceeds 50 ms.
LF, HF VLF LF/HF ratio Low-frequency and high-frequency power.
Very low-frequency power. The ratio of low-frequency power to high-frequency power

LFn, HFn
The normalized low-frequency and high-frequency power.  (1) and (2) Other features corr2 If the length of a sequence is n, and the first and last n-2 points constitute sequences s 1 , s 2 respectively, corr2 is the coefficient of s 1 and s 2 .
If the length of a sequence is n, and the first and last n-2 points constitute sequences s 1 , s 2 respectively, corr2 is the coefficient of s 1 and s 2 .
HRV parameters vary with changes of sleep duration and sleep stages. If we plot the variation trend of a HRV parameter in a sliding window whose length is five minutes and step size is 30 s in a whole night's sleep, we can find out the regularities in how HRV parameters vary synchronously with the changing of sleep stages. Figure 1 is an example, demonstrating the synchronous relationship of the index α1 and SE with sleep stages. The number of times in which the change in successive normal sinus intervals exceeds 50 ms. LF, HF VLF LF/HF ratio Low-frequency and high-frequency power.
Very low-frequency power. The ratio of low-frequency power to high-frequency power LFn, HFn The normalized low-frequency and high-frequency power.
Features relevant to respiration coRR Coefficient of an RR interval sequence and its second-order smoothed sequence resf The dominant frequency in the range of 0.15 to 0.5 Hz strf The standard deviation of resf  (1) and (2) Other features corr2 If the length of a sequence is n, and the first and last n-2 points constitute sequences s1, s2 respectively, corr2 is the coefficient of s1 and s2.
HRV parameters vary with changes of sleep duration and sleep stages. If we plot the variation trend of a HRV parameter in a sliding window whose length is five minutes and step size is 30 s in a whole night's sleep, we can find out the regularities in how HRV parameters vary synchronously with the changing of sleep stages. Figure 1 is an example, demonstrating the synchronous relationship of the index ɑ1 and SE with sleep stages. 2. Candidate features that represent the relationship between current data unit and the adjacent/overnight data units Considering that the feature value saltus is often accompanied by the transformation of sleep stages, the relationship between each data unit to be classified and the adjacent or overnight data units has a great influence on the classification results. In the process of classifier construction, we need to introduce new features to reflect whether there is a  Candidate features that represent the relationship between current data unit and the adjacent/overnight data units Considering that the feature value saltus is often accompanied by the transformation of sleep stages, the relationship between each data unit to be classified and the adjacent or overnight data units has a great influence on the classification results. In the process of classifier construction, we need to introduce new features to reflect whether there is a trend rise or fall in HR and respiration rate between two adjacent data units, and compare the level of HR and respiration rate of this data unit with that within one hour and overnight. Based on this, the features we extracted are listed in Table 2. The rising and falling trend features are the slope value between the first and second maximum points (t1, t2) and the first and second minimum points (b1, b2) are given in 3 min windows before and after current data unit. Table 2. Candidate features that represent the relationship between current data unit and the adjacent/overnight data units.

HRm_m1, resm_m1
The ratio of average HR and respiration rate in current data unit to that in current one-hour.
HRm_ma, resm_ma. The ratio of average HR and respiration rate in current data unit to that in overnight data.

Useless features elimination
In the process of model building, some features made little difference on the improvement of sleep stage recognition accuracy or could be replaced by other features. Thus, further feature selection is needed. For the features in Table 1, we limited the number of features to 2 or 3 and exhausted all the feature combinations to examine the recognition effect of different sleep stages. Then, the feature combinations were ranked in the order of recognition accuracy and Youden's index. Youden's index is a method to evaluate the authenticity of the screening test [29]. The index is the average performance of sensitivity and specificity of a classification result. It gives equal weight to false positive and false negative values, so all tests with the same value of the index give the same proportion of total misclassified results.
Features frequently demonstrating relatively high accuracy and Youden's index were reserved and others were excluded. For feature combinations, we first computed the first principal component with principal component analysis (PCA) and it was smoothed at a small and a large frame length. An offset value was added on the large frame length smoothed curve. The detailed procedure is similar to building the recognizers of the model, which is introduced below in the section on "Steps of model construction". When trying to recognize one sleep stage, some features pull down the average value of the combination's performance and some have no remarkable effect. They are all useless for the recognition of a certain stage. Finally, we obtained four groups of "Useless features" and the features in their intersection are removed from the whole features set. This method reduces the number of features without losing the potential for the improvement of the model's performance.
For the features in Table 2, Support Vector Machine methods based on Recursive Feature Elimination (SVM-RFE) was applied to implement the feature selection for each classifier. SVM-RFE method was first proposed by Guyon when he studied the gene selection for cancer classification [30]. This method sorts the features in the order of their significance based on the score of each feature, which helps us eliminate those features that make little difference to the performance of the model and reserve those that are useful to obtain the best accuracy.

Principle of recognizers' construction
In our work, we first observe how the principal component of the selected features varies in the whole night relative to its smoothed tendency. We used Savitzky-Golay FIR filter as the smoothing filter. The relative position is an important basis for distinguishing different sleep stages. Our task is to look for feature combinations suitable for recognizing a specific sleep stage. To achieve this, we exhaust all the feature combinations, calculating their first principal component respectively, then transform them into their zero mean and unit variance. For a subject's whole-night data, the first principal component F 1 is smoothed at a small and a large frame length, named SF 1 and BF 1 . SF 1 follows the variation in F 1 more closely and BF 1 reflects the major transformation of sleep cycles in the night. Figure 2A is an example. F 1 is the first principal component of a combination of four HRV feature (coRR, HR, resf and α2). The relative position of SF 1 to BF 1 relates to how deep the sleep is. When SF 1 is above BF 1 and they are far from each other, a jump often appears in sleep stages from NREM sleep to wakefulness or REM sleep. Thus, we can recognize a specific sleep stage by the relative position of SF 1 to BF 1 . In order to make the recognition effect better, the BF 1 curve should be moved up or down a little bit. We used k to denote a small distance. We put the combination of four features (coRR, HR, resf, α2) and offset k together and called it the wakefulness recognizer. In our work, we first observe how the principal component of the selected featu varies in the whole night relative to its smoothed tendency. We used Savitzky-Golay filter as the smoothing filter. The relative position is an important basis for distinguish different sleep stages. Our task is to look for feature combinations suitable for recogniz a specific sleep stage. To achieve this, we exhaust all the feature combinations, calcula their first principal component respectively, then transform them into their zero mean unit variance. For a subject's whole-night data, the first principal component F smoothed at a small and a large frame length, named SF1 and BF1. SF1 follows the varia in F1 more closely and BF1 reflects the major transformation of sleep cycles in the ni Figure 2 (A) is an example. F1 is the first principal component of a combination of f HRV feature (coRR, HR, resf and ɑ2). The relative position of SF1 to BF1 relates to h deep the sleep is. When SF1 is above BF1 and they are far from each other, a jump o appears in sleep stages from NREM sleep to wakefulness or REM sleep. Thus, we recognize a specific sleep stage by the relative position of SF1 to BF1. In order to make recognition effect better, the BF1 curve should be moved up or down a little bit. We u k to denote a small distance. We put the combination of four features (coRR, HR, resf, and offset k together and called it the wakefulness recognizer. We apply the same procedures to look for the appropriate features combination k to construct the REM recognizer, light sleep (N1 and N2 included) recognizer and d sleep (N3 stage) recognizer respectively. For the recognition of wakefulness and R sleep, the criterion is SF1 > BF1+k, and for the recognition of deep sleep, the criterion is < BF1+k. Furthermore, BF1+k1 < SF1 < BF1+k2 is used to recognize the light sleep stage. criterion differs due to the characteristics of HRV features. When a subject is aroused f NREM sleep, a jump would occur from a relatively low value to a higher value in m We apply the same procedures to look for the appropriate features combination and k to construct the REM recognizer, light sleep (N1 and N2 included) recognizer and deep sleep (N3 stage) recognizer respectively. For the recognition of wakefulness and REM sleep, the criterion is SF 1 > BF 1 + k, and for the recognition of deep sleep, the criterion is SF 1 < BF 1 + k. Furthermore, BF 1 + k1 < SF 1 < BF 1 + k2 is used to recognize the light sleep stage. The criterion differs due to the characteristics of HRV features. When a subject is aroused from NREM sleep, a jump would occur from a relatively low value to a higher value in many major HRV features except SE and α2, whose variation regularity is often inverse to other features. Before PCA, we should reverse the two features to their opposite number in order to achieve better accuracy. Boundary value processing based on variable threshold and window width In the actual data analysis, if only BF 1 ± k is used as the threshold to identify a certain sleep stage, the data segments at the intersection of SF 1 and BF 1 ± k are likely to be misjudged. We use variable windows and local threshold adjustment to correct the classification of W, R and D segments. First, the windows in which the height difference and slope between maximum points and minimum points meet the threshold requirements are found. Taking W recognizer as an example, the window starts from the minimum point of SF 1 below BF 1 , along with the transformation of SF 1 from negative to positive, passing the maximum point of SF 1 above BF 1 + k, until SF 1 goes down through the intersection of BF 1 (as shown in Figure 3). We performed threshold adjustment at the medial side of the intersections of SF 1 and BF 1 + k. The threshold of W and R recognizer is lifted, and D recognizer is in the opposite direction. The width and height of the threshold adjustment is proportional to the slope and height difference between the high point and low point on the same side (H1&L1, H2&L2). major HRV features except SE and ɑ2, whose variation regularity is often inverse to other features. Before PCA, we should reverse the two features to their opposite number in order to achieve better accuracy.

Boundary value processing based on variable threshold and window width
In the actual data analysis, if only BF1±k is used as the threshold to identify a certain sleep stage, the data segments at the intersection of SF1 and BF1±k are likely to be misjudged. We use variable windows and local threshold adjustment to correct the classification of W, R and D segments. First, the windows in which the height difference and slope between maximum points and minimum points meet the threshold requirements are found. Taking W recognizer as an example, the window starts from the minimum point of SF1 below BF1, along with the transformation of SF1 from negative to positive, passing the maximum point of SF1 above BF1+k, until SF1 goes down through the intersection of BF1 (as shown in Figure 3). We performed threshold adjustment at the medial side of the intersections of SF1 and BF1+k. The threshold of W and R recognizer is lifted, and D recognizer is in the opposite direction. The width and height of the threshold adjustment is proportional to the slope and height difference between the high point and low point on the same side (H1&L1, H2&L2)

Determination method of the final category
To classify ECG segments into four groups, a multi-classifier fusion method was applied. We first pass every ECG segment into the four recognizers to see which categories each ECG segment belongs to, respectively. A problem may occur if an ECG segment belongs to more than one sleep stage, so it is necessary to consider their grouping carefully. Here, the sigmoid-fitting method proposed by Platt [31] was used. In this method, the output result of the standard SVM is fitted with the LR model (sigmoid function), and the original output value of the model is mapped to the probability value [0, 1]. Firstly, we use SVM to build the WR/N classifier, W/R classifier and L/D classifier. Depending on the type and number of labels that may be attached to each data segment, they are selectively passed through each of the three classifiers. The data segments selected by multiple recognizers are first sent to the WR/N classifier. The probabilities they belong to W+R and N classes are P(WR) and P(N) respectively. Then they are passed through W/R classifier and L/D classifier to get the conditional probability P(W|WR), P(R|WR), P(L|N) and P(D|N). Therefore, the probability that each data segment belongs to the four categories is the joint probability of the above two categories. The classification to which the maximum probability belongs is taken as the final judgment result of the data segment, that is, the category corresponds to the maximum value among P(WR)×P(W|WR), P(WR)×P(R|WR), P(N)×P(L|N) and P(N)×P(D|N). For data segments whose labels only contain W & R, or L & D, we do not calculate the probability of P(WR) or P(N).

3
Determination method of the final category To classify ECG segments into four groups, a multi-classifier fusion method was applied. We first pass every ECG segment into the four recognizers to see which categories each ECG segment belongs to, respectively. A problem may occur if an ECG segment belongs to more than one sleep stage, so it is necessary to consider their grouping carefully. Here, the sigmoid-fitting method proposed by Platt [31] was used. In this method, the output result of the standard SVM is fitted with the LR model (sigmoid function), and the original output value of the model is mapped to the probability value [0, 1]. Firstly, we use SVM to build the WR/N classifier, W/R classifier and L/D classifier. Depending on the type and number of labels that may be attached to each data segment, they are selectively passed through each of the three classifiers. The data segments selected by multiple recognizers are first sent to the WR/N classifier. The probabilities they belong to W + R and N classes are P(WR) and P(N) respectively. Then they are passed through W/R classifier and L/D classifier to get the conditional probability P(W|WR), P(R|WR), P(L|N) and P(D|N). Therefore, the probability that each data segment belongs to the four categories is the joint probability of the above two categories. The classification to which the maximum probability belongs is taken as the final judgment result of the data segment, that is, the category corresponds to the maximum value among P(WR) × P(W|WR), P(WR) × P(R|WR), P(N) × P(L|N) and P(N) × P(D|N). For data segments whose labels only contain W & R, or L & D, we do not calculate the probability of P(WR) or P(N).
The procedure to operate this classification algorithm is demonstrated in Figure 4. Every 4.5 min ECG data segment is first recognized by the four recognizers (wakefulness recognizer, REM stage recognizer, light sleep recognizer and deep sleep recognizer) which are abbreviated to W recognizer, R recognizer, L recognizer and D recognizer, respectively. Each recognizer is responsible for determining whether this RR intervals sequence belongs to its category. If yes, we label this RR intervals sequence with the tag of the recognizer (W, R, L and D). After being checked by all four recognizers, each RR intervals sequence is attached with tags of different number, ranging from zero to four. Based on the category and number of the tags, different strategies are applied. (1) For the RR intervals sequence with only one tag, they are already categorized. (2) For the RR intervals sequence with two tags, if the two tags are from recognizer Group A (W and R recognizer) or Group B (L and D recognizer), that is, the tags on the sequence are WR or LD, they are divided further by classifiers used to differentiate W and R, or L and D. If the two tags are from Group A and Group B respectively (WL, WD, RL, RD), they are decided by WR/N classifier. (3) For the RR intervals sequence with no tag or with three or four tags, they are first sent to WR/N classifier to obtain the probability values P 1WR and P 1N , and then sent to the W/R classifier and L/D classifier to obtain the conditional probabilities P 2W , P 2R , P 2L and P 2D . The final classification result corresponds to the maximum joint probability of the above two steps' probability. Every 4.5 min ECG data segment is first recognized by the four recognizers (wakefulness recognizer, REM stage recognizer, light sleep recognizer and deep sleep recognizer) which are abbreviated to W recognizer, R recognizer, L recognizer and D recognizer, respectively. Each recognizer is responsible for determining whether this RR intervals sequence belongs to its category. If yes, we label this RR intervals sequence with the tag of the recognizer (W, R, L and D). After being checked by all four recognizers, each RR intervals sequence is attached with tags of different number, ranging from zero to four. Based on the category and number of the tags, different strategies are applied. (1) For the RR intervals sequence with only one tag, they are already categorized. (2) For the RR intervals sequence with two tags, if the two tags are from recognizer Group A (W and R recognizer) or Group B (L and D recognizer), that is, the tags on the sequence are WR or LD, they are divided further by classifiers used to differentiate W and R, or L and D. If the two tags are from Group A and Group B respectively (WL, WD, RL, RD), they are decided by WR/N classifier. (3) For the RR intervals sequence with no tag or with three or four tags, they are first sent to WR/N classifier to obtain the probability values P1WR and P1N, and then sent to the W/R classifier and L/D classifier to obtain the conditional probabilities P2W, P2R, P2L and P2D. The final classification result corresponds to the maximum joint probability of the above two steps' probability. . Graphical representation of classification procedure. (Classification procedure: Every ECG segment is detected by the four recognizers in turn. The sequence will be labeled by "W", "R", "L" or "D" if it passes the correspondent recognizer. The tags on each sequence are unequal in number. For example, "WR" indicates the sequence passes the W recognizer and R recognizer at the same time. For those segments with two tags, the WR/N classifier, W/R classifier and L/D classifier perform the judge function to decide which category they belong to. For those segments with 0 or more than 2 tags, we determine the final classification result by calculating the probability that they belong to each category.). . Graphical representation of classification procedure. (Classification procedure: Every ECG segment is detected by the four recognizers in turn. The sequence will be labeled by "W", "R", "L" or "D" if it passes the correspondent recognizer. The tags on each sequence are unequal in number. For example, "WR" indicates the sequence passes the W recognizer and R recognizer at the same time. For those segments with two tags, the WR/N classifier, W/R classifier and L/D classifier perform the judge function to decide which category they belong to. For those segments with 0 or more than 2 tags, we determine the final classification result by calculating the probability that they belong to each category.).
After the classification procedure, the results could be modified to conform to the physiological laws of real sleep. The REM stages which emerge too early (in the first 80 min after sleep begins) should be replaced by wakefulness, and deep sleep in the half hour before waking up was modified to light sleep.

Model Performance Guarantee and Examination
Ten-fold cross validation was used in the training set when we built the model. The test set was divided randomly into five equal parts which were named K1 to K5 for the examination of the results balance. We used the accuracy and average F 1 -score to evaluate the performance of the model. F 1 -score is the harmonic mean of the precision and recall. Precision is also known as positive predictive value, and recall is also known as sensitivity in binary classification.

Data Preprocessing
The data sources for this model are the subjects' ECG data and their sleep questionnaires. The ECG data covering less than 5 h or with too much noise mixed in (noise occupying more than 25% of data) was discarded. If there is too much information omitted to calculate the total score, the questionnaire together with the ECG data will also be excluded. Finally, there were 113 subjects left. The data information is listed in Table 3. Half of them were categorized into the training set and the other half into the test set randomly. The subjects belonging to each group were distributed equally in the two sets.

Model Principle
We aimed to divide all subjects into two groups, "good sleep" and "poor sleep", according to objective sleep quality features. Subjective sleep score acts as a standard and verification.
From the sleep staging results, we mainly extracted features that can reflect sleep fragmentation, sleep depth and sleep duration. For the measurement of sleep fragmentation, we calculated the occurrence number of successive W stages and the intensity of sleep stage transitions. Here, the intensity of sleep stage transitions is defined as follows: First, we labelled W, R, and L/D stages 1, 0 and 1 respectively. Then, we calculated the absolute value of the difference of the two neighboring labels. Finally, we summed up all the absolute differences as the intensity of sleep stage transitions.
Other features include sleep efficiency, sleep and REM incubation, total sleep and different sleep stages duration, proportion of different sleep stages, wakefulness times, and the frequency of W and R labels after 6 h since the beginning (measurement of early awakening).
The SVM-RFE method was also used to conduct feature selection. We also adopted accuracy and F 1 -score as the standard of the model's performance and five-fold cross validation was used.

Feature Selection Result
From the features set, there are 13 HRV features left for the building of the recognizers and the features that represent the mutual relationship of data units are all essential in the construction of the classifiers. The 13 HRV features are HR, PNN50, LF, LF/HF ratio, coRR, resf, strf, α1, α2, SE, P 1 , G 1 and corr2. Variance analysis was implemented for every feature to prove their distinguishing ability for at least one sleep stage. The result suggests that every feature demonstrates a high significant difference for at least one sleep stage. Statistical difference of all the features in different sleep stages is demonstrated in Figure 5 using the mean ± standard deviation format. Each subfigure represents a specific feature, and we can clearly see that their mean value varies with the sleep stages. For example, the feature SE in D stage is distinctly larger than the other three stages, and α2 in L stage is much smaller than the other three stages. Successive two samples u test result shows the p values are all less than 0.001 which indicates that all features make a difference picking the four stages out from all ECG segments.  Figure  5 using the mean±standard deviation format. Each subfigure represents a specific feature, and we can clearly see that their mean value varies with the sleep stages. For example, the feature SE in D stage is distinctly larger than the other three stages, and ɑ2 in L stage is much smaller than the other three stages. Successive two samples u test result shows the p values are all less than 0.001 which indicates that all features make a difference picking the four stages out from all ECG segments. Experiment results showed that the combination of specific features which are suitable for the recognition of a certain sleep stage are not applicable for the recognition of other sleep stages. The combination of features that suit each recognizer best is listed in Table 4. Table 4. Feature combinations that suit each recognizer best.

Recognizer/Classifier
Features Combinations W recognizer coRR, HR, resf, ɑ2 R recognizer coRR, HR, resf, ɑ1, ɑ2, G1 Experiment results showed that the combination of specific features which are suitable for the recognition of a certain sleep stage are not applicable for the recognition of other sleep stages. The combination of features that suit each recognizer best is listed in Table 4. For the sleep quality evaluation model, nine features are left and they are sleep efficiency, total sleep time, the occurrence number of successive W stages, the intensity of sleep stage transitions, and the frequency of W and R labels after 6 h since the beginning, NREM and W duration, R and D proportion. The third-order polynomial kernel was the most fitting for the model by testing.

Performance of Sleep Staging Model
The algorithm has been tested in CincDB and SHHS1DB and the results of four classes' categorization (wakefulness, REM, light sleep and deep sleep) are listed in Table 5. There is not huge difference in accuracy and F 1 -score between the two databases. Accuracy is the average value of all data and F 1 -score is the mean value of different categories. We obtained an accuracy of 0.661 and an F 1 -score of 0.625 in CincDB. The accuracy and F 1 -scores in SHHS1DB were 0.659 and 0.624 respectively. The confusion matrixes are in Appendix B (Tables A1 and A2).

Performance of Sleep Quality Evaluation Model
The results of each fold are demonstrated in Table 6. The accuracy in the test set is 0.786 and the average F1-score is 0.771 at the same time.

Discussion
In this paper, we proposed a scheme that is easy to transplant to the common wearable ECG devices to conduct sleep quality evaluation and applied it to practical conditions. Our model has been tested by the international public database and medically approved sleep questionnaires involving over 100 subjects. In the model building procedure, a new method for sleep staging using single-lead wearable ECG signals has been developed and the mapping relationship between objective sleep parameters and subjective sleep questionnaires has been expressed by a SVM model.
In the sleep staging model building procedure, we balanced light weight and performance. We did not use too many features or a complex computing method. Researchers have pointed out the good coincidence of HR & pulse rate (PR), HRV & pulse rate variability (PRV) [13]; thus, we can also obtain pulse intervals as a replacement of cardiac intervals. The pulse acquisition devices are more easily accessible. Our scheme suggests a possible way of applying the cardiac cycle sequence to sleep quality. This is important for the routine testing of special post staff and ordinary persons.
We also tried our best to construct a model with a clear frame which provides a reasonable possibility for further modification. With the development of hardware power, complex machine learning and deep learning methods have become popular for their good performance. However, abstract neural networks are like a black box; it is unclear what is inside. This limits the promotion and revision of the model. Clarity and ease of understanding are the advantages of our model and this is conducive for model migration to populations of different ages or disease.
In the process of sleep stage model building, feature selection is a critical step. In the process of constructing recognizers and classifiers, further feature selection was implemented to achieve the best accuracy and Youden's index. Table 7 is an example of feature selection for recognizing the REM stage. The features combinations of high and low accuracy and Youden's index are listed and the features existing in the upper half of the table occupy Important positions in REM stage recognition; in contrast, the features appearing in the bottom half of the table do not play a key role. From Table 7 we can find that LF, HF, VLF, LFn, HFn, SD1, SD2, SDSD, SDNN, RMSSD pull down the accuracy and FE, SE have no remarkable effect, so they are useless to REM. This method helps reduce the feature quantity to 13. In Table 4, we notice that the feature combination of each recognizer is quite different. BF 1 is the large frame length smoothed filtering result reflecting the transformation cycle of the whole sleep. HRV features are measurement of sympathetic and parasympathetic nerve activities, which is quite different from EEG. Thus, the algorithm accuracy based on ECG signals performs more poorly than algorithms based on EEG signals, which are the most important part in PSG sleep staging. Table 8 lists the four classes' grouping performance in other researchers' work. Some used ECG signals only and some added respiratory inductance plethysmography signals to build the model. In addition, the data source, data processing and result evaluation methods they chose are not the same. It is hard to compare the performance of models only by ranking the accuracy strictly. Youden's index is a measurement of the balance and credibility of the experimental results. It is a combination of sensitivity and specificity. Sensitivity is the accuracy of the target group recognition and specificity is the ability to pick out non-target groups precisely. For every link in the process of model building, considerations of Youden's index are as important as target group accuracy. Ten-fold cross validation was used to deal with the potential over-fitting problem, although the phenomenon of over-fitting was not apparent regardless of whether the volume of training set was small or big because of the visualization of our model's design thinking and since every ECG segment is considered in the whole night sleep cycle rather than separately and unrelatedly tested by a machine learning model. The large frame length smoothed trend line BF 1 behaves as the comparative standard of the principal component of the combination of HRV parameters to substitute for an invariable threshold value in the model's design thinking. This is because the same value range may correspond to different sleep stages for a specific feature with sleep progressing. Despite all that, individual differences are inevitable and that is the main source of error and accuracy decline.
In addition, the performance of each recognizer is affected by the feature quantity, k value and the frame length of smoothing filter. We take W recognizer as an example to show how the accuracy correlates with feature quantity ( Figure 6). The accuracy with a certain number of features is the highest with the best feature combination and with suitable Youden's index.  Youden's index is a measurement of the balance and credibility of the experimental results. It is a combination of sensitivity and specificity. Sensitivity is the accuracy of the target group recognition and specificity is the ability to pick out non-target groups precisely. For every link in the process of model building, considerations of Youden's index are as important as target group accuracy. Ten-fold cross validation was used to deal with the potential over-fitting problem, although the phenomenon of over-fitting was not apparent regardless of whether the volume of training set was small or big because of the visualization of our model's design thinking and since every ECG segment is considered in the whole night sleep cycle rather than separately and unrelatedly tested by a machine learning model. The large frame length smoothed trend line BF1 behaves as the comparative standard of the principal component of the combination of HRV parameters to substitute for an invariable threshold value in the model's design thinking. This is because the same value range may correspond to different sleep stages for a specific feature with sleep progressing. Despite all that, individual differences are inevitable and that is the main source of error and accuracy decline.
In addition, the performance of each recognizer is affected by the feature quantity, k value and the frame length of smoothing filter. We take W recognizer as an example to show how the accuracy correlates with feature quantity (Figure 6). The accuracy with a certain number of features is the highest with the best feature combination and with suitable Youden's index.  Differences exist between subjective and objective sleep parameters. Sleep efficiency is the ratio of the total time one is asleep to the total time one is dedicated to sleep. Sleep incubation is the time interval between one beginning to try sleeping and one falling asleep. We take them as examples. We calculate the objective and subjective sleep efficiency of every subject and plot a scatter diagram shown in Figure 7. From Figure 7 we notice that the sleep efficiency of the majority of subjects is greater than 0.8, and objective sleep efficiency tends to be smaller than subjective sleep efficiency. Similar results for sleep incubation are demonstrated in Figure 8. This is in accord with [5], although our sleep staging results are not the accurate PSG results. Differences exist between subjective and objective sleep parameters. Sleep efficiency is the ratio of the total time one is asleep to the total time one is dedicated to sleep. Sleep incubation is the time interval between one beginning to try sleeping and one falling asleep. We take them as examples. We calculate the objective and subjective sleep efficiency of every subject and plot a scatter diagram shown in Figure 7. From Figure 7 we notice that the sleep efficiency of the majority of subjects is greater than 0.8, and objective sleep efficiency tends to be smaller than subjective sleep efficiency. Similar results for sleep incubation are demonstrated in Figure 8. This is in accord with [5], although our sleep staging results are not the accurate PSG results.  Differences exist between subjective and objective sleep parameters. Sleep efficiency is the ratio of the total time one is asleep to the total time one is dedicated to sleep. Sleep incubation is the time interval between one beginning to try sleeping and one falling asleep. We take them as examples. We calculate the objective and subjective sleep efficiency of every subject and plot a scatter diagram shown in Figure 7. From Figure 7 we notice that the sleep efficiency of the majority of subjects is greater than 0.8, and objective sleep efficiency tends to be smaller than subjective sleep efficiency. Similar results for sleep incubation are demonstrated in Figure 8. This is in accord with [5], although our sleep staging results are not the accurate PSG results.  With respect to the connection between subjective and objective sleep quality, there are studies revealing the correlation of subjective and objective sleep parameters with significance level p. However, only p is useless for the prediction work. No precise numerical connections between objective and subjective sleep quality parameters have been disclosed and we tried to understand the mapping relationship from sleep staging parameters to the questionnaire score using a machine learning method; however, apparent correspondence cannot be found, so we seek the second best outcome by just categorizing sleep quality into two classes and trying our best to meet practical demands.
There are still some limitations in this study, which should be addressed and improved in future studies. First, since most special position staff are currently male, we did not study gender differences in this work. This is a limitation of the work, and more female subjects need to be included in the experiment to explore the possible impact of gender differences to enhance the generalization of the model. Second, the dataset of this study excludes the data of most patients with sleep apnea. Sleep disorders and their sleep structures deserve more attention and further exploration. Night-to-night variability also needs to be considered in the model building. Third, since our research still needs to be improved on the precise discrimination of sleep stages by ECG, the sleep staging methods based on RR intervals can only be applied to daily home monitoring or rough assessment of sleep status of special positions staff, which is far from the requirements of clinical diagnosis. To bridge this gap, more basic studies on ECG are needed. Fourth, for practical applications, more real data for testing are needed. Fifth, cross-dataset validation is very important in improving the model's generalization ability. Without it, when the model is applied to other datasets, the reliability of the prediction results will be affected. Therefore, improving the cross-dataset performance of the model is one of the key problems in the future. Sixth, since we have no access to accurate sleep stage labeling of special position staff, we cannot figure out the exact relationship between sleep stages and sleep quality. Despite PSG interfering with sleep, such data are important to improve the performance of the sleep quality evaluation model. Finally, the applicability and migration of the algorithm between different populations and different diseases need to be modified and improved in practice. Pre-study of each person is a good method to overcome individual differences.

Conclusions
This paper aims at developing a sleep quality evaluation strategy with single-lead wearable cardiac interval signals, so that the method can be easily transplanted into portable electronic products. To achieve this, we first use PCA, alterable boundary and multipleclassifier fusion method to build a sleep staging model. After that, we established a sleep quality evaluation model with the sleeping time ECG data and sleep questionnaires from 200 persons. The SVM and SVM-RFE methods were used. Although some limitations exist, our model reached a decent performance with fewer features and an explicable frame structure. This provides the possibility for daily home use or routine testing in working conditions if PSG is not easily accessible. The future work is to optimize and improve the algorithm to suit different populations, ages and sleep diseases.

Scoring system
The items in Sleep Log Questionnaire are combined to form 7 "component" scores, each of which has a range of O-3 or 0-4 points. In all cases, a score of "0" indicates no difficulty, while a score of "3" or "4" indicates severe difficulty. The 7 component scores are then added to yield one "global" score, with a range of O-22 points, "0" indicating no difficulty and "22" indicating severe difficulties in all areas. Component 1: Subjective sleep quality Examine question 3c, and assign scores "0", "1", "2", "3" to the options " 1 ", " 2 ", " 3 ", " 4 " respectively. Component 2: Sleep latency Examine questions 1b and 1c, and calculate their time difference. Assign scores as follows: