Wrist Photoplethysmography Signal Quality Assessment for Reliable Heart Rate Estimate and Morphological Analysis

Photoplethysmographic (PPG) signals are mainly employed for heart rate estimation but are also fascinating candidates in the search for cardiovascular biomarkers. However, their high susceptibility to motion artifacts can lower their morphological quality and, hence, affect the reliability of the extracted information. Low reliability is particularly relevant when signals are recorded in a real-world context, during daily life activities. We aim to develop two classifiers to identify PPG pulses suitable for heart rate estimation (Basic-quality classifier) and morphological analysis (High-quality classifier). We collected wrist PPG data from 31 participants over a 24 h period. We defined four activity ranges based on accelerometer data and randomly selected an equal number of PPG pulses from each range to train and test the classifiers. Independent raters labeled the pulses into three quality levels. Nineteen features, including nine novel features, were extracted from PPG pulses and accelerometer signals. We conducted ten-fold cross-validation on the training set (70%) to optimize hyperparameters of five machine learning algorithms and a neural network, and the remaining 30% was used to test the algorithms. Performances were evaluated using the full features and a reduced set, obtained downstream of feature selection methods. Best performances for both Basic- and High-quality classifiers were achieved using a Support Vector Machine (Acc: 0.96 and 0.97, respectively). Both classifiers outperformed comparable state-of-the-art classifiers. Implementing automatic signal quality assessment methods is essential to improve the reliability of PPG parameters and broaden their applicability in a real-world context.


Introduction
Wearable devices (WDs) are among the most widespread technologies introduced in recent years [1], potentially revolutionizing healthcare. With the aging population and the higher incidence of chronic diseases [2,3], there is a growing need to provide healthcare services capable of reaching people who require frequent medical check-ups, especially those with low mobility and who live in remote areas. With their compact dimensions, high portability, and low manufacturing cost, WDs can efficiently perform long-term recordings outside healthcare facilities, allowing for the remote, continuous monitoring of a user's health and, in turn, the early detection of anomalies [4,5].
Commonly embedded in commercial smartwatches and fitness trackers worn at the wrist, one of the most used WD technologies is photoplethysmography (PPG), an optical technique that detects blood volume changes using a light source and a matched photodetector. The former illuminates a portion of the body surface, penetrating the skin • Systolic foot: the beginning of the systolic phase and the minimum of the pulse; • Systolic peak: the most prominent maximum; • Dicrotic notch: most visible in healthy young subjects, it is supposed to represent the closure of the aortic valve [10]; • Diastolic peak: the second prominent maximum of the pulse.
detector. The former illuminates a portion of the body surface, penetra blood vessels. The latter detects the changes (using reflected or transm on the PPG sensor design [6]) modulated by the pulsatile blood flow, w pends on the heartbeat, vessel stiffness, and respiratory rate [7]. The PPG signal presents a quasi-periodic stereotyped waveform, PPG pulse, which occurs with each heartbeat [8]. Each PPG pulse can be phases: the anacrotic phase, which relates to the systolic heart contract crotic phase, which depends both on the diastolic heart phase and on th flected from the peripheral artery [9]. Within each PPG pulse, in ideal fiducial points can be identified, as highlighted in Figure 1: • Systolic foot: the beginning of the systolic phase and the minimum • Systolic peak: the most prominent maximum; • Dicrotic notch: most visible in healthy young subjects, it is suppose closure of the aortic valve [10]; • Diastolic peak: the second prominent maximum of the pulse. The PPG signal is strictly related to heart dynamics. Indeed, it is ex commercial devices for heart rate (HR) estimation [3,11] and subseque (HRV) analysis [12,13]. For example, HR can be estimated simply by dete foot or peak, calculating the time difference between two consecutive then calculating the ratio between 60 and the calculated time difference beats/min [14,15].
Besides the HR estimation, it has long been recognized that the P valuable information in its morphology [16]. Recent research has corrobo in emotion recognition [17][18][19] and cardiovascular measurements [20,21 In real-world applications, the preferred ground for PPG technolog able estimates both for HR and morphological features, is hampered by bility to external noise and motion artifacts [22,23]. Consequently, the in cannot be used in clinical practice for diagnostic purposes. Before furt signal quality analysis is essential to promote this signal's clinical use.
Based on the definitions provided by the recent literature [2,24], the pulse exploitable for further analysis can be expressed as: The PPG signal is strictly related to heart dynamics. Indeed, it is extensively used in commercial devices for heart rate (HR) estimation [3,11] and subsequent HR variability (HRV) analysis [12,13]. For example, HR can be estimated simply by detecting the systolic foot or peak, calculating the time difference between two consecutive occurrences, and then calculating the ratio between 60 and the calculated time difference, expressing it in beats/min [14,15].
Besides the HR estimation, it has long been recognized that the PPG signal carries valuable information in its morphology [16]. Recent research has corroborated this finding in emotion recognition [17][18][19] and cardiovascular measurements [20,21].
In real-world applications, the preferred ground for PPG technology, obtaining reliable estimates both for HR and morphological features, is hampered by its high susceptibility to external noise and motion artifacts [22,23]. Consequently, the information above cannot be used in clinical practice for diagnostic purposes. Before further processing, a signal quality analysis is essential to promote this signal's clinical use.
Based on the definitions provided by the recent literature [2,24], the quality of a PPG pulse exploitable for further analysis can be expressed as: • Basic-quality pulse: systolic peaks are clearly identifiable; • High-quality pulse: the pulse waveform is clean and well-defined, with systolic and diastolic waves visible. While HR and some morphological features related to detecting the systolic peak can be estimated from Basic-quality pulses, more sophisticated morphological features require the detection of both systolic and diastolic peaks [25][26][27], so only High-quality pulses are suitable.
However, most previous studies only aim to detect PPG pulses for HR estimate, without rating their suitability for a more in-depth morphological analysis [31][32][33][34][35][36][37]. Moreover, some base the quality estimation on a time window that includes several pulses [24,28,29,31,34,36,37] rather than a pulse-wise analysis, losing relevant information that individual PPG pulses can convey as a result. Such a segment-wise analysis might also discard pulses suitable for analysis.
Although the publicly available datasets represent a considerable resource for training and testing automatic classifiers, they do not allow for a proper quality characterization for real-world purposes. To the best of our knowledge, most of the currently available datasets are based on recordings of finger PPG signals in a clinical context, imposing several limitations. Since it is well-known that the morphology strongly depends on the measurement site [10,38], the translation of a method based on signals recorded at the finger to signals recorded at the wrist (the preferred measurement site for real-world applications) is not feasible. Furthermore, the available datasets do not provide any ground truth information about the different quality of the signals (i.e., Basic and High), but only dichotomous labels (e.g., usable vs. non-usable). Finally, these datasets rely on hospital recordings, a context in which motion artifacts are far less frequent and less impactful than in the real world during daily life activities.
Recent works used PPG signals recorded by wrist-worn WDs in a real-world context and collected PPG pulses prone to lifelike motion artifacts [29][30][31] to overcome these limitations. Unfortunately, in these studies, no information is provided about the motion of the sensors, so it is unclear to what degree the related method is robust to daily life motion artifacts.
This work aimed to develop two motion-aware classifiers: • Basic-quality classifier: it detects all pulses with valid information content, exploitable for heart rate estimation, and the extraction of basic morphological features; • High-quality classifier: it detects all pulses with distinct systolic and diastolic waves, exploitable for the extraction of more in-depth morphological features.
We collected wrist PPG data for about 24 h to design and test our classifiers in a real-world context. First, we defined different activity ranges to categorize the level of motor activity, which translates into motion artifacts in the PPG signals. Activity ranges were identified based on data from the accelerometer embedded in the same wrist-worn WD used to record the PPG signal. Then, for each range from each subject, we randomly selected PPG pulses to be classified. In this way, the classifiers could be trained using data subjected to different levels of motion artifacts, usually experienced in real-world contexts.
Such an approach could help in improving the reliability of the valuable biomarkers obtained by wrist PPG signals, minimizing the loss of information by conducting a pulsewise analysis and selecting pulses suited for a specific analysis (i.e., HRV and fundamental morphological analysis or a more in-depth morphological analysis).

Wearable Device
An Empatica E4 [39] wristband was used to record the signals. The E4 is a CE medicalgrade device that allows for the continuous, simultaneous recording of several physiological signals, including PPG and accelerometer data. The PPG sensor is equipped with four light sources (two green, two red) and two photodetectors; the signal is sampled at a frequency of 64 Hz. The tri-axial accelerometer has a range of ±2 g and is sampled at 32 Hz.

Participants
A total of 31 recordings by as many participants were used. All the subjects were instructed to wear the Empatica E4 for 24 h while carrying on with their normal daily activities. The participants were asked to provide their age and gender; other personal information was not collected.

PPG Preprocessing and Pulse Detection
A second-order Butterworth band-pass filter with cut-off frequencies of 0.5 and 12 Hz was applied for each PPG recording [31]. The algorithm by Elgendi et al. [40], originally developed to detect second derivative PPG fiducial points, was adapted to detect the systolic peak and systolic foot of each pulse to segment the signal into single pulses. Each pulse was then normalized with the z-score procedure:

Activity Index and Definition of Activity Ranges
To categorize pulses according to different amounts of movement, the activity index (A ind ) presented in [41] was calculated for each pulse. To this aim, each accelerometer (ACC) component (x, y, z) was resampled at f s ACC−RES = 64 Hz with linear interpolation (to match the PPG sampling frequency) and converted to g units. Next, a fourth-order band-pass filter was applied, with cut-off frequencies of 0.025 and 10 Hz [42,43]. The ACC vector magnitude was then calculated for each sample j as: The A ind was estimated using the algorithm of Lin et al. [41]: • Standard deviation of A j for 5-second epochs: where M is set to 12 to obtain a minute-wise A ind by summing 12 5-second epochs.
Once we estimated the A ind for each recording, we defined four activity ranges (AR) based on the quartiles of all the A ind values to label an equal number of pulses in each activity range.

Labelling Procedure
Within each recording, we randomly selected a subset of 100 PPG pulses from each activity range, thus obtaining 400 pulses for each recording (12,400 labelled pulses in total). Three independent raters (S.M., S.L.G., and G.M.) then assigned a quality level to each pulse, selecting from one of the three levels defined below [2]: • Bad (B): systolic and diastolic peaks cannot be easily distinguished from noise → the pulse is not suitable for further analysis. • Fair (F): the systolic peak is clearly detectable; the diastolic peak is not → it is possible to estimate the heart rate and some basic morphological features. • Excellent (E): systolic and diastolic peaks are both clearly detectable → it is possible to estimate the heart rate, and basic morphological features, and perform an in-depth morphological analysis.
An example of the three quality levels is illustrated in Figure 2. A Matlab graphic user interface was developed to help the raters annotate the quality of the selected pulses, as shown in Figure 3. The Matlab findpeaks function was applied to highlight the local maxima of the selected pulse and help detect the systolic and diastolic peaks.
Inter-rater agreement was assessed by calculating the overall Fleiss Kappa Score [44]. A majority voting approach was applied to determine the level if only two raters agreed. If there was no agreement among raters (i.e., each rater chose a different quality level), the pulse was automatically labelled as B.  Inter-rater agreement was assessed by calculating the overall Fleiss Kappa Score [44]. A majority voting approach was applied to determine the level if only two raters agreed. If there was no agreement among raters (i.e., each rater chose a different quality level), the pulse was automatically labelled as B.  Inter-rater agreement was assessed by calculating the overall Fleiss Kappa Score [44]. A majority voting approach was applied to determine the level if only two raters agreed. If there was no agreement among raters (i.e., each rater chose a different quality level), the pulse was automatically labelled as B.

Signal Quality Indices
We estimated nineteen signal quality indices (SQIs), listed in Table 2, corresponding to the selected and labelled pulses recorded in a real-world context. Specifically, we estimated:

•
2 SQIs from accelerometer data; • 17 SQIs from PPG pulses. We estimated the computational complexity of each feature in terms of Floating-point operations (FLOPs) by using the Matlab package developed by Qian [45].
Labelled PPG pulses were divided into training and test sets, with a proportion of 70% for the training set (22 subjects; 8800 pulses) and 30% for the test set (9 subjects; 3600 pulses).
SQIs from the training and test set pulses were then separately subjected to a Box-Cox transformation [46] and z-scored.

SQIs Selection
To limit the use of redundant SQIs, we applied a Neighborhood Component Analysis (NCA) separately for the two classifiers. NCA is a non-parametric method for selecting features to maximize a classifier's accuracy [47]. As output, NCA provides a weight for each feature: the higher the weight, the more influential the feature is for solving the classification problem. We first tuned the NCA regularization parameter λ using ten-fold cross-validation on the training set to find the value that minimizes the classification loss. We then labelled those features with a weight greater than 20% of the maximum weight. To reach higher robustness of the selected features set, we ran the NCA ten times and then selected those features that were labelled at least 80% of the time.

Basic-and High-Quality Classifiers
We designed the following classifiers: • Basic-quality (BQ) classifier: it detects those pulses that can be used to estimate heart rate and for basic morphological analysis (i.e., the union of F and E pulses); • High-quality (HQ) classifier: it detects those pulses that can be used for in-depth morphological analysis (i.e., E pulses).
To develop the HQ classifier, we investigated two alternative strategies: 1.
Discern the union of B and F pulses against E pulses through a single-stage approach; 2.
Discern between F and E pulses downstream of a BQ classifier through a multi-stage approach. A scheme illustrating the two strategies and the related classifiers is shown in Figure 4. In summary:

•
The BQ classifier is trained to detect the F&E classes against the B class; • The Type 1 HQ classifier (HQ1) is independent of BQ and is trained to detect the E class against the B&F class ( Figure 3, panel A); • The Type 2 HQ classifier (HQ2) is trained to detect the E class against the F class, having as an input the pulses selected by the BQ classifier ( Figure 3, panel B).
rate and for basic morphological analysis (i.e., the union of F and E pulses); • High-quality (HQ) classifier: it detects those pulses that can be used for in-depth morphological analysis (i.e., E pulses).
To develop the HQ classifier, we investigated two alternative strategies: 1. Discern the union of B and F pulses against E pulses through a single-stage approach; 2. Discern between F and E pulses downstream of a BQ classifier through a multi-stage approach.
A scheme illustrating the two strategies and the related classifiers is shown in Figure  4. In summary: • The BQ classifier is trained to detect the F&E classes against the B class; • The Type 1 HQ classifier (HQ1) is independent of BQ and is trained to detect the E class against the B&F class ( Figure 3, panel A); • The Type 2 HQ classifier (HQ2) is trained to detect the E class against the F class, having as an input the pulses selected by the BQ classifier ( Figure 3, panel B). We first split the dataset into training (70%) and test (30%) sets both for BQ and HQ classifiers. We then conducted a ten-fold cross-validation on the training set with five machine learning (ML) algorithms (Tree, Naïve Bayes, Support Vector Machine, K-nearest neighborhood, and Ensemble) and a neural network (NN) for hyperparameters optimization by using Bayesian optimization with 30 iterations. Finally, we trained and tested the classifiers with the full features set, and the SQIs selected features only.
We computed the following performance metrics on unseen data coming from the test set relative to the detection of eligible pulses (F&E pulses for the BQ classifier, E for HQ classifiers): area under the ROC curve (AUC), accuracy, sensitivity, specificity, precision, Matthew's correlation coefficient (MCC), F1 score, and Cohen's kappa (κ). We first split the dataset into training (70%) and test (30%) sets both for BQ and HQ classifiers. We then conducted a ten-fold cross-validation on the training set with five machine learning (ML) algorithms (Tree, Naïve Bayes, Support Vector Machine, Knearest neighborhood, and Ensemble) and a neural network (NN) for hyperparameters optimization by using Bayesian optimization with 30 iterations. Finally, we trained and tested the classifiers with the full features set, and the SQIs selected features only.
We computed the following performance metrics on unseen data coming from the test set relative to the detection of eligible pulses (F&E pulses for the BQ classifier, E for HQ classifiers): area under the ROC curve (AUC), accuracy, sensitivity, specificity, precision, Matthew's correlation coefficient (MCC), F1 score, and Cohen's kappa (κ).
All the methods were implemented in Matlab 2021b. The whole signal processing and classification pipeline is illustrated in Figure 5. All the methods were implemented in Matlab 2021b. The whole signal processing and classification pipeline is illustrated in Figure 5.

State-of-the-Art Classifiers
We selected and adapted two classifiers from the literature to establish a benchmark for the performance of our classifiers.
(i) Jang et al. [30] proposed two classifiers based on the signal similarity between adjacent PPG pulses, a parameter also used in our work (SigSim). Their study identified

State-of-the-Art Classifiers
We selected and adapted two classifiers from the literature to establish a benchmark for the performance of our classifiers.
(i) Jang et al. [30] proposed two classifiers based on the signal similarity between adjacent PPG pulses, a parameter also used in our work (SigSim). Their study identified three quality levels (i.e., good, moderate, and low) based on detecting the PPG pulse second derivative's fiducial points [8]. Then, two dichotomous classifiers, conservative and non-conservative, were developed. The former compares the good-quality level pulses against the merge of moderate-and low-quality level pulses, while the latter compares the good-and moderate-quality level pulses against low-quality level pulses. Each classifier is based on a fixed threshold, determined using the equal training sensitivity and specificity criterion [48], meaning that the optimal threshold is obtained by minimizing the difference between sensitivity and specificity. Jang et al.'s non-conservative classifier is analogous to our BQ classifier, and their conservative classifier is analogous to both our HQ1 and HQ2 classifiers.
(ii) The classifier proposed by Elgendi [24] is built on a Support Vector Machine that classifies 60-second PPG segments as belonging to one of three quality levels (i.e., excellent, acceptable, or unfit for diagnosis) based on the skewness property of the segment. We adapted this method to perform a pulse-wise analysis. Furthermore, since no information regarding the hyperparameters was reported, we applied the same approach described in Section 2.8 to find the best hyperparameters combination.

Experimental Data
We obtained real-world recordings of physiological signals from 31 subjects (15 males, 16 females), with a mean age of 37 years (±14) and an average recording length of 26:50 h (±05:51). All subjects were Caucasian, except for one African subject.

Activity Ranges
From the A ind values estimated from the accelerometer signal, we obtained the following AR built on the quartile values of the A ind distribution: • According to the classification proposed by Lin et al. [41], the activity ranges 0-3 correspond to rest/sleep, rest/sleep/sedentary, light, and light/moderate activity, respectively. This means that the distribution of A ind is skewed towards lower activity levels in our population.

Labelling Results
A total of 12,400 pulses were labelled by three independent raters, who agreed on 86% of the labels. Only 57 pulses (0.004%) were labelled differently by each rater and hence relegated to the B category. Overall, the inter-rater agreement was high, with a Fleiss Kappa Score of 0.84, representing perfect agreement according to Landis and Koch [49]. Using a majority voting approach, we set the final labels to train and test the classifiers: 5962 B pulses (48.08%), 4612 F pulses (37.19%), and 1826 E pulses (14.73%). The overall distribution of the three quality levels among the four activity ranges is shown in Figure 6. As expected, as the A ind (the amount of movement) increases, the percentage of B pulses gets higher, and the percentage of F and E pulses gets lower. hence relegated to the B category. Overall, the inter-rater agreement was high, with a Fleiss Kappa Score of 0.84, representing perfect agreement according to Landis and Koch [49]. Using a majority voting approach, we set the final labels to train and test the classifiers: 5962 B pulses (48.08%), 4612 F pulses (37.19%), and 1826 E pulses (14.73%). The overall distribution of the three quality levels among the four activity ranges is shown in Figure  6. As expected, as the (the amount of movement) increases, the percentage of B pulses gets higher, and the percentage of F and E pulses gets lower.

SQIs Selection
Considering N, the pulse length, the computational complexity to calculate the 19 features is approximately 37*N FLOPs. The computational complexity for each feature is reported in Supplementary Materials, Table S1.
We conducted SQIs selection separately for the BQ, HQ1, and HQ2 classifiers. In Table 3, the best λ values and their respective minimum classification loss values are reported for the three classifiers.  Tables S2, S3, and S4 for the BQ, HQ1, and HQ2 classifiers, respectively.

SQIs Selection
Considering N, the pulse length, the computational complexity to calculate the 19 features is approximately 37*N FLOPs. The computational complexity for each feature is reported in Supplementary Materials, Table S1.
We conducted SQIs selection separately for the BQ, HQ1, and HQ2 classifiers. In Table 3, the best λ values and their respective minimum classification loss values are reported for the three classifiers.  Tables S2-S4 for the BQ, HQ1, and HQ2 classifiers, respectively.

Basic-Quality Classifiers
A total of 5962 pulses belong to the B class (4260 used in the training set and 1702 in the test set), while 6438 pulses belong to the F&E class (4540 used in the training set and 1898 in the test set). Table 4 presents the performances of the BQ classifiers on the test set. The best method using the full features set is the SVM with a Quadratic kernel, reaching an accuracy of 0.9606 and a well-balanced sensitivity (0.9603) and specificity (0.9547). On the other hand, the GentleBoost Ensemble reached the best performance among the methods trained and tested with the selected SQIs, with slightly lower values for accuracy (0.9536) and sensitivity (0.9384) but specificity (0.9706) higher than the best method using the full features set. Final hyperparameters are reported in Supplementary Materials, Table S5. Concerning the state-of-the-art classifiers, the threshold based on the equal training sensitivity and specificity criterion (identified in the work of Jang et al. [30]) is 0.922. Concerning the classifier proposed by Elgendi [24], the SVM with the Gaussian kernel function provided the best performance in terms of sensitivity (0.8398) and specificity (0.5764) with an accuracy of 0.7153. Our classifier outperformed both state-of-the-art classifiers for the selected performance measures. Results obtained with state-of-the-art classifiers are shown in the lower panel of Table 4.

High-Quality Classifiers
For the Type 1 High-quality classifiers, a total of 10,574 pulses belong to the B&F class (7754 used in the training set and 1702 in the test set), while 1826 pulses belong to the E class (1046 used in the training set and 780 in the test set). Table 5 presents the performances of the HQ1 classifiers on the test set. The best method for balancing sensitivity and specificity is the SVM, using all the features (Sens = 0.9244, Spec = 0.9784) or the subset of selected SQIs (Sens = 0.9192, Spec = 0.9702). In both cases, the SVM has a Quadratic kernel. Final hyperparameters are reported in Supplementary Materials, Table S6.  For the Type 2 High-quality classifiers, 4612 pulses belong to the F class (3494 used in the training set and 1118 used in the test set), while the distribution of pulses belonging to the E class is the same used to train and test the HQ1 classifiers Table 6 presents the performances of the HQ2 classifiers on the test set. The kNN method using the subset of features selected by the NCA provided the best results regarding sensitivity-specificity balance (Sens = 0.9321, Spec = 0.9195). The final hyperparameters are reported in Supplementary Materials, Table S7.  By comparing the best HQ1 and HQ2 classifiers, HQ1 achieved better performances in terms of accuracy and specificity (Acc = 0.9667, Spec = 0.9784) with respect to HQ2 (Acc = 0.9247, Spec = 0.9195), but slightly lower sensitivity (HQ1 Sens = 0.9244 vs. HQ2 Sens = 0.9321).
Concerning the state-of-the-art classifiers, the threshold identified for the HQ1 classifier with Jang's method [30] was 0.991. The linear SVM obtained the best performance in reproducing the classifier proposed by Elgendi [24]. However, both state-of-the-art classifiers performed worse than our classifier: the accuracy was 0.7090 for Jang's and 0.8406 for Elgendi's. Notably, the former reached moderate sensitivity (0.6301) and specificity (0.7245), while the latter showed a sensitivity closer to zero (0.0167).
The threshold for the HQ2 classifier with Jang's method [30] was 0.993. In reproducing Elgendi's classifier, the quadratic SVM obtained the best performance. Additionally, in this case, both state-of-the-art classifiers performed worse than our best HQ2 classifier, similar to what we observed for the HQ1 classifier.

Discussion
In this work, we developed automatic classifiers to detect PPG pulses suitable for further processing based on their peculiar morphological characteristics. First, using accelerometer data, we estimated the activity level of the subjects. We then detected four activity ranges based on the quartile values of aggregated A ind s from all the recordings. From each recording, we randomly selected 100 pulses for each activity range. Of the 19 SQIs estimated from each labelled pulse, eight and nine SQIs were selected to train and test the algorithms to develop the Basic-and the two High-quality classifiers, respectively. The best algorithms were then chosen, and the classifiers' performances were compared against two state-of-the-art classifiers.
Categorizing pulses by activity level allowed us to train the algorithms with pulses containing distinct amounts of motion artifacts. In this way, the ability of classifiers to detect PPG pulses suitable for heart rate estimate or morphological analysis under various movement intensities could be achieved. However, it appears evident from Figure 5 that only a tiny portion of pulses in the highest activity range reached F or E quality levels, even if the highest activity range in our dataset corresponded to light/moderate activity in the staging proposed by Lin et al. [41]. Several methods have been proposed to suppress the effect of motion artifacts on the PPG signals, either via software [50,51] or hardware [52,53] approaches. Our results suggest that future studies should combine algorithms for motion artifact suppression with a layer dedicated to signal quality analysis. This approach would be more conservative, allowing us to obtain reliable parameters from a larger proportion of recorded pulses, even during intense physical activity.
The three independent raters reached a perfect agreement in the labelling procedure, probably thanks to the strict definitions given for each quality level. The high level of the inter-rater agreement also ensures the reliability of the resulting classifiers.
For each PPG pulse, we estimated 19 SQIs, calculated from two sources (i.e., PPG and ACC signals). Nine SQIs were novel and proposed for the first time in this study. The SQIs feature selection phase revealed that eight and nine SQIs were sufficient to solve the classification problem optimally for the BQ and both types of HQ classifiers, respectively. It is worth noting that most of the selected SQIs are novel features. In particular, two of the newly introduced statistical parameters (MedianPulse, StdPulse_noZ) and two parameters related to the PPG pulse morphology (Npeaks, ZDR) were selected for all classifiers here presented, adding important information that helped better solve the classification problem.
Although the extraction of multiple features inevitably increases the computational complexity compared with the extraction of a single feature, the cost of the features presented in this work remains low and grows linearly with N. Moreover, it is interesting to note that the NCA selected features with increasing computational complexity for the BQ (5*N FLOPs), HQ1 (19*N FLOPs), and HQ2 (25*N FLOPs) classifiers, in line with the increasing complexity of the classification problem.
It is also worth noting that the Peak2PeakACC feature from the accelerometer data was selected only for BQ and HQ1 classifiers, and not for the HQ2 classifier. This can be ascribed to the fact that B pulses (involved in both BQ and HQ1 classifiers) are generated because of motion artifacts, while the F and E pulses are largely independent of the movement.
All the implemented algorithms performed well to achieve BQ and HQ1 classifiers. Except for the Neural Network fed with the full features set, all the methods showed an accuracy higher than 0.90. However, the two classifiers differed in sensitivity and specificity: BQ classifiers showed a balanced sensitivity and specificity, while the HQ classifiers had specificity higher than sensitivity (on average, 0.9728 compared to 0.9729). This difference can be ascribed to the imbalance in the number of pulses in the two classes (only 1826 pulses belonging to the E class compared to 10,574 belonging to the B&F classes), meaning that the algorithms are better trained in detecting pulses belonging to B&F class than to the E class.
Regarding performance, some algorithms used to develop the HQ2 classifiers performed relatively poorly, except for the Ensemble and Tree algorithms. Again, the imbalance between F and E pulses (4612 F pulses against 1826 E pulses) may have played a role. However, as also pointed out by Elgendi [24], it was reasonable to expect that a classifier aiming at detecting E pulses against pulses belonging to a single quality level achieved worse performance than a classifier trained to detect E pulses against different quality pulses. In addition, it is necessary to consider the inevitable error propagation that a system of two cascaded classifiers entails. There may be some B pulses wrongly classified within the F&E pulses by the first stage BQ classifier, so performances might be even worse than the ones reported in this study since the HQ2 classifier was trained and tested only with real F and E pulses.
Our best classifiers outperformed the two state-of-the-art classifiers. Notably, the identified thresholds set for the Jang et al. [30] classifiers were higher than the values reported in the original work: 0.922 versus 0.673 for the BQ classifier, and 0.991 (0.993) versus 0.796 for the HQ1 (HQ2) classifier. These discrepancies could be due to the higher quality levels of the F and E pulses identified in this work. However, the Jang et al. [30] BQ classifier attained good performance, with an accuracy of 0.9253, considering that a single SQI was used. On the other hand, the classifier proposed by Elgendi [24] demonstrated moderate performance for the BQ classifier (Sens = 0.8398, Spec = 0.5764) and poor performance for both HQ classifiers (Sens = 0.0167, Spec = 0.8406 for type 1; Sens = 0, Spec = 0.9991 for type 2).
The proposed classifiers can help extend the use of PPG signals recorded by wearable devices in the real world. On the one hand, the BQ classifier showed promising results, both in terms of sensitivity and specificity. Baek et al. [23] highlighted the detrimental effect on HRV analysis of missing inter-beat intervals. For this reason, a highly sensitive classifier is essential for detecting all pulses that can be used for HR estimation without losing discriminatory power by eliminating too many pulses because of their low quality. On the other hand, SVM selected as the best HQ classifier has high specificity with (relatively) low sensitivity. However, compared to other methods, it shows the best performance in terms of MCC, F1, and Cohen's κ. The importance of an HQ classifier is obvious, given the number of significant applications that have been proposed in the last few years. Features extracted from PPG morphology could be used, for example, for stress detection purposes [26,54,55] or blood pressure estimation [56][57][58], thus allowing for continuous monitoring with a simple wristband. A large part of the population at risk of developing, e.g., burnout syndromes or cardiovascular disease, would benefit from this achievement.
As a side result of this work, we built an annotated dataset that can be further exploited for future studies. As an ongoing activity, we are working on the preparation of the dataset to be publicly available.
This study has some limitations, most of which are related to the sample population used to train and test the algorithms. First, more robust classifiers could be obtained by increasing the sample size: more subjects and labelled pulses would indeed be beneficial, preferably including subjects with arrhythmias or other cardiac pathologies. As this study was conceived, the classifiers we developed cannot discern arrhythmias from noise, thus potentially discarding arrhythmic beats that could also be useful for diagnostic purposes. Moreover, the algorithms' training phase could be refined by considering subjects' age. As pointed out in [7], the dicrotic notch is more pronounced in healthy young than in older adults, and PPG morphology changes with age [25]. Therefore, a future study could collect and balance pulses belonging to different age groups both in the training and testing set. In addition, a further advancement of the method here proposed can be achieved by using recordings from different devices to train the signal quality algorithm. In fact, the results could be device dependent, thus limiting the generalizability to other devices.
The classifiers developed in this study have not been tested in real time. This is a crucial aspect to be assessed to understand whether the signal quality assessment can be smoothly embedded in the processing pipeline of wearable devices to provide reliable information with an acceptable delay [3]. Providing reliable health information in real-time would indeed facilitate the delivery of personalized treatments to the patient if and when needed [59].

Conclusions
This work aimed to develop two pulse-wise classifiers to detect reliable wrist PPG pulses that can be used in a real-world context for heart rate estimation and morphological analysis. We trained and tested several algorithms with a combination of features derived from different sources, including several novel features, and by selecting PPG pulses subjected to different levels of motion artifacts. The best performances were obtained by using subsets of features for both Basic-and High-quality classifiers. For both classifiers, the SVM with a Quadratic kernel achieved the best performance. Our results could help in improving the reliability and generalizability of the valuable biomarkers obtained by wrist PPG signals. Furthermore, the pulse-wise approach minimizes the loss of information by selecting all pulses suitable for either heart rate variability or morphological analysis. Future work can optimize the classifiers by increasing the sample size (both in terms of subjects and various cardiac health conditions) used to train the algorithms and explore the feasibility of embedding these methods in wearable devices for real-time applications.
Supplementary Materials: The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/s22155831/s1, Table S1: The computational complexity for each feature. N = pulse length; Table S2: Results from neighborhood component analysis for the Basic-quality classifier applied ten times; Table S3: Results from neighborhood component analysis for the Type 1 High-quality classifier applied ten times; Table S4: Results from neighborhood component analysis for the Type 2 High-quality classifier applied ten times; Table S5: Hyperparameters for Basic-quality classifiers; Table S6: Hyperparameters for Type 1 High-quality classifiers; Table S7: Hyperparameters for Type 2 High-quality classifiers. Institutional Review Board Statement: The study was conducted in accordance with the Declaration of Helsinki. A portion of the data come from a study approved by Ethical Committee of Area Vasta Emilia Centro (Bologna, Italy; approval n • 542-2019-OSS-AUSLBO). For the rest of the data, no approval from the local ethical committee was needed.
Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.

Conflicts of Interest:
The authors declare no conflict of interest.