An Evaluation of Non-Contact Photoplethysmography-Based Methods for Remote Respiratory Rate Estimation

The respiration rate (RR) is one of the physiological signals deserving monitoring for assessing human health and emotional states. However, traditional devices, such as the respiration belt to be worn around the chest, are not always a feasible solution (e.g., telemedicine, device discomfort). Recently, novel approaches have been proposed aiming at estimating RR in a less invasive yet reliable way, requiring the acquisition and processing of contact or remote Photoplethysmography (contact reference and remote-PPG, respectively). The aim of this paper is to address the lack of systematic evaluation of proposed methods on publicly available datasets, which currently impedes a fair comparison among them. In particular, we evaluate two prominent families of PPG processing methods estimating Respiratory Induced Variations (RIVs): the first encompasses methods based on the direct extraction of morphological features concerning the RR; and the second group includes methods modeling respiratory artifacts adopting, in the most promising cases, single-channel blind source separation. Extensive experiments have been carried out on the public BP4D+ dataset, showing that the morphological estimation of RIVs is more reliable than those produced by a single-channel blind source separation method (both in contact and remote testing phases), as well as in comparison with a representative state-of-the-art Deep Learning-based approach for remote respiratory information estimation.


Introduction
Physiological signs, such as Blood Volume Pulse (BVP), Blood Pressure, Electro-Dermal Activity, blood Oxygenation levels (SpO 2 ) and Respiration Waveforms and Rates are of chief importance in a variety of contexts related to health monitoring and affective computing. Among them, Respiratory Rate (RR) is crucial to detect and assess respiratory dysfunctions, such as apnea [1], as well as changes in breathing patterns that may be caused by stress [2].
Traditionally, reliable electrocardiography sensors or respiration belts have been adopted to measure RR, but in some cases, these approaches are not feasible due to several reasons, from physical discomfort, to the need of employing dedicated equipment and expertise, up to the actual impossibility of placing the required sensors [3]. Alternative solutions rely on the adoption of photoplethysmography (PPG) sensors, such as pulse oximeters, being less invasive and augmenting the portability [4]. PPG sensors capture the reflected light skin variations due to the blood volume changes. By design, their straightforward application concerns the measurement of cardiac activity; however, due to the tight bond between cardiac and respiratory activities, signals derived from PPG waveforms may be employed to extract respiratory-related information [5]. Interestingly enough, even general-purpose smartphone cameras can function as pulse oximeters [6], thus allowing the extraction of PPG signals without the use of dedicated equipment. As a • Respiratory-Induced Intensity Variation (RIIV) refers to the modulation of the BVP signal's amplitude caused by changes in venous return due to variations in intrathoracic pressure during the respiratory cycle. As a result, the PPG signal experiences baseline modulation. Inspiration causes a reduction in intra-thoracic pressure, leading to a slight decrease in central venous pressure and an increase in venous return, while expiration causes the opposite effect. • Respiratory-Induced Amplitude Variation (RIAV) is caused by a reduction in left ventricular stroke volume due to changes in intra-thoracic pressure, resulting in a decrease in cardiac output and peripheral pulse strength. During expiration, the opposite effect occurs. • Respiratory-Induced Frequency Variation (RIFV) refers to the periodic variation of Heart Rate (HR) throughout the respiratory cycle, with an increase during inspiration and a decrease during expiration. This change in pulse rate is due to an autonomic nervous system response known as respiratory sinus arrhythmia (RSA).
The chief concern of this work is to benchmark representative approaches that allow to extract RIVs from rPPG signals and to perform a sound statistical assessment of the results on a publicly available dataset. Notably, although other works have attempted to compare such methods (see, for instance, Ref. [14]); to the best of our knowledge, this is the first attempt at analysing these methods' performances on a publicly available dataset. The research questions that the present work aims at addressing are the following:

1.
Which kind of approach is more suitable for extracting respiratory related information from rPPG? 2.
What is the impact of using rPPG on the quality of estimation compared to contact-PPG? 3.
How does this result compare to representative modern approaches relying on Deep Learning methods? 4.
Are the eventual differences statistically significant?
Two main kinds of approaches have been proposed in the literature to achieve RR estimation via rPPG: (1) methods based on the direct extraction of morphological features attributable to breathing (RIVs) [11,12,[14][15][16][17], and (2) methods aimed at isolating the motion trend due to heart rate (HR) and RR [18][19][20], implicitly related to RIVs. By and large, the reliability of these approaches has been hitherto tested on small and/or private datasets, thus preventing a fair comparison among them.
Specifically, in order to evaluate methods based on the direct RIV extraction, we addressed the well-known Incremental Merge Segmenting (IMS), presented in Ref. [21], while putting in place several solutions to fuse the produced morphological features (e.g., average, median or PCA). As to the second group, the most promising approaches adopt the single-channel blind source separation to isolate the respiration from heart rate and noise. Here, we consider the Empirical Mode Decomposition (EMD) [22] and the singular spectrum analysis (SSA), and compare their estimates based on the different channels.
In our experimental analysis, the RR estimation is evaluated on both the contact reference and remote-PPG to assess the reliability of the different approaches mentioned above. In order to make the experiments reproducible and extendable, we use the publicly available BP4D+ dataset [23] that includes the RR ground truth, the contact reference, and RGB videos suitable to estimate remote-PPG signals. This last step is accomplished by exploiting the pyVHR framework [24].
In addition, an experiment involving a deep learning-based solution has been carried out. Specifically, the aforementioned signal processing-based approaches were compared with a representative state-of-the-art Deep Learning-based approach devised to estimate cardio-respiratory information from RGB videos.
The remainder of this paper is organized as follows: in Section 2 the Photoplethysmography is introduced; in Section 3, the considered RR estimation algorithms are outlined; in Section 4, the experimental analysis is reported, and in Section 5 the results so far achieved are discussed.

Photoplethysmography and Remote-Photoplethysmography
The PPG signal conveys information concerning blood volume changes in the microvascular bed of the tissue [25]. Its waveform typically displays: (1) A pulsatile component of artery blood (AC); (2) a high-frequency component (HF), composed of a systolic phase and a diastolic phase, which provides heart rate information; and (3) a non-pulsatile component of artery blood (DC), which is a low-frequency component (LF), related to blood oxygen saturation that slowly varies with respiration. It is possible to infer the respiratory rate from a PPG signal from this slow modulation in the LF component. Contact signals in classic PPG provide intensity change measurements of the light reflected from the finger skin when exposed to a Light-Emitting Diode (LED) source.
Similarly, remote-PPG aims at characterizing the blood volume changes, but taking into account reflected light skin variations as captured by an RGB camera and focusing on a peculiar Region of Interest (ROI), such as the cheeks or the forehead. Remote-PPG can thus be conceived as an approximation of the contact reference, while offering a notable advantage: it is contactless and remotely controllable.
Generally speaking, the time signal produced by classical PPG sensors can be closely approximated by the temporal traces of the RGB signal, which are generated by averaging the skin's light intensity at the pixel level within the ROI and concatenating them on a perframe basis. Several methods have been proposed to derive the remote-PPG from the RGB traces [26]. Here, we adopt the CHROM method [27], since contrarily to most of the other rPPG methods, it removes specular reflections at the skin surface. The implementation of CHROM used here is part of the pyVHR framework [24,26].
The subsequent procedures involve dividing a rPPG signal or a contact reference signal x(t) of length T into S segments or windows, each of M-sample length, with a shift of K samples between adjacent segments, thus producing an overlapping of M − K samples. The j-th segment, with j ∈ [0, ..., S − 1], can be expressed as where jK is the starting element of segment j.

Respiratory Rate Estimation
In this section, the operational processes and algorithmic mechanisms of the three mentioned methods, namely, Incremental Merge Segmenting (IMS), Empirical Mode Decomposition (EMD), and Singular Spectrum Analysis (SSA), are discussed in detail.

IMS Algorithm
IMS is an effective technique for analyzing PPG signals, extracting morphological features and detecting artifacts [6]. It operates in the time domain by segmenting the signal with sliding windows of fixed length and duration of a few tens of seconds (up to 30) (Equation (1)). From a bare technical point of view, it is a fusion of two algorithms, namely, the Iterative-End-Point-Fit (IEPF) [28] and the incremental algorithm [29]. Both of these algorithms were originally designed for computer vision, mobile robotics, and ECG signal compression applications. In a nutshell, IMS simply segments the (r)PPG signals in order to detect peaks, and subsequently obtain pulse amplitude, maximum and minimum intensity, and pulse period. These features are then employed to extract RIVs. More specifically, beats are detected at the end of a sequence of positive-gradient segments (systolic upslopes).
The IMS algorithm has been employed in Ref. [21] for the segmentation of remote-PPG signals and subsequent extraction of respiratory waveforms, as detailed below.
For each j-th segment, the input of the IMS algorithm are the segment values x j (t) and an integer parameter m < M (m is common to all segments). IMS uses a collection of strided points Y j = {y k = x j (mk) : k = 1, . . . , M/m } to compute morphological features. Specifically, it first computes the slope ρ k of each sub-segment Line[y k , y k+1 ], and then iteratively removes from the Y j set all central points of the triplets (y k , y k+1 , y k+2 ) that satisfy sign(ρ k · ρ k+1 ) > 0, that is, slopes with the same sign. The points stored in Y j are finally used to obtained minima Y min j and maxima Y max j through peak detection. The extraction of respiratory-induced variations (RIVs) can be easily accomplished by combining and filtering the values of Y min j and Y max j to effectively reduce any artifacts. In particular, given a generic sequence of values x, the artifact reduction is obtained computing the first derivative of its cubic spline interpolation: The three types of Respiratory Induced Variations can be mined from IMS-segmented PPG waveforms in the following way [30]: • The respiratory induced intensity variation (RIIV) is conveyed by the local maximumpeak-valued time series: • The respiratory induced amplitude variation (RIAV) is carried by the series generated from the difference between local maximum values and local minimum values (amplitude trend): where y max The respiratory induced frequency variation (RIFV) can be calculated by creating a tachogram composed of evenly sampled and minimum-peak-time-interspersed series, which consists of the time intervals between consecutive local minima: where y min i ∈ Y min j and arg(y min i ) are the time instants associated to the i-th value in Y min j . An example of extracted RIVs, from both contact and remote PPG, is shown in Figure 1, while Figure 2 shows at a glance the three RIVs on a sample cardiac signal from the BP4D+ dataset.
The RIFV j , RI IV j , and RI AV j extracted from (r)PPG signals can be used individually to generate remote RR estimations or combined to obtain more reliable estimates [14,15,31]. To this end, besides the average and the median over the three RIVs, PCA is computed considering the RIVs as an ensemble of realizations of the respiratory trend. RIVs are therefore projected onto their principal component, and only the first principal component is taken into account for the RR estimation.
This process produces six estimates: RR RIFV , RR RI IV , RR RI AV , RR avg , RR median , and RR PCA .

EMD: Empirical Mode Decomposition
Empirical mode decomposition (EMD) is a powerful analytical tool used to effectively describe non-linear and non-stationary time series. This approach involves projecting the time series onto a space basis composed of intrinsic mode functions (IMFs) [22]. Unlike Fourier Transforms and wavelet decomposition, EMD works entirely within the time domain to decompose the data into its constituent IMFs, and it does not require any prior assumptions about the signal's frequency or time-frequency characteristics. Instead, it adaptively decomposes the signal based on its local frequency content using a a timedomain algorithm called "sifting", capturing the underlying dynamics of the system. In a nutshell, EMD isolates a small number of temporally adaptive basis functions (the IMFs) and directly derives the frequency and amplitude dynamics from them [32]. The IMFs can be thought of as locally changing counterparts to common frequencies. This characteristics of IMFs make them a valuable tool for analyzing non-stationary time series with rapidly varying frequencies, allowing for the accurate identification and extraction of frequency and amplitude dynamics. More precisely, an IMF is defined as a function that satisfies well-defined conditions related to its Hilbert transform, and summarized in the following two properties: 1.
The number of extrema (i.e., the maximum and minimum amplitudes of the signal) and the number of zero-crossings must be equal, or differ by one at most. This property ensures that the function is oscillatory in nature, with a well-defined and localized frequency structure that can be accurately extracted using techniques such as the Hilbert transform.

2.
The function must be symmetric with respect to a local zero mean. The need for a local time scale to calculate the mean makes it challenging to define a function as symmetric for non-stationary processes. To solve this problem, the local mean envelope concept is introduced, which is determined by the function's local maximum and minimum values. By enforcing local symmetry around this envelope, the IMF can be accurately characterized and used to extract information about the underlying dynamics of the system.
In formal terms, given an arbitrary non-stationary PPG signal x(t), EMD adaptively decomposes its segments where r j (t) represents the residual non-stationary trend. The iterative sifting process to derive a generic k-th IMF function is summarized in the following iterative steps. As initialization, we set the data to be processed as v(t) = x j (t), k = 1, i = 0.

1.
Extract the local maxima and minima of v(t).

2.
Form the upper and lower envelope e u (t) and e l (t) by cubic spline interpolation of the extrema, and compute the mean m(t) = (e l (t) + e u (t))/2.

4.
When i > 1, evaluate whether d i (t) is a zero-mean function. This is obtained in terms of the standard deviation of two subsequent iteration results: If the standard deviation exceeds a fixed threshold (set to 0.1, according to [32]), set v(t) = d i (t) as the new data, increment i, and repeat steps 1-4 until ending up with the k-th IMF, that is, h k j (t) = d i (t). Once the k-th IMF is obtained, the remaining IMFs were computed by applying the sifting process to the residual signal defined as r(t) = v(t) − h k j (t), repeating the steps 1-4 to compute the next IMF, until the n-th residual is a monotonic function or a function with less than two local maxima or minima.
An example of h k j (t) computation is shown in Figure 3. From an implementation perspective, we adopted the solution provided by Ref. [32], limiting the number of IMFs extracted to four components, as suggested in Ref. [21]. The output of the EMD algorithm is shown in Figure 4.

RES
[mmHg] (a)   For the RR extraction, only the first three IMFs, among the IMFs extracted through the EMD algorithm, were taken into account.
The RR I MFi estimate is obtained for each IMF mode by selecting the frequency in the respiratory spectral domain with the highest power. The highest peaked frequency among the estimates RR I MF1 , RR I MF2 , RR I MF3 , RR I MF4 is taken as the RR EMD estimate.

SSA: Singular Spectrum Analysis
Singular spectrum analysis [33] is a decorrelation technique that projects a single mixture of zero-mean sources (time series) onto an orthonormal space basis. It involves decomposing the time series (PPG in our case) into a set of empirical orthogonal functions (EOFs) by constructing a trajectory matrix from the original data and applying the Singular Value Decomposition (SVD) to the matrix. The result is a set of principal components (PCs) that capture the most dominant patterns in the data. These PCs are then used to reconstruct the original time series in a way that separates the signal from the noise.
To perform the embedding, we mapped a PPG segment x j (t) of length M into a sequence X i j of K = M − L + 1 lagged vectors of size L, with 1 < L ≤ M: The L-trajectory matrix X associated to x j (t) is and the lagged vectors X i j represent the columns of X. Each row and column of X consists of subseries of the original segment (or series) x j (t). It holds that the trajectory matrix X is a Hankel matrix, where the elements along any diagonal where i + j is constant have the same value.
The basic steps of SSA are as follows.

1.
Embedding. The realization x j (t) is embedded into a trajectory matrix X using the sliding window approach described above. 2.
SVD decomposition. To obtain the principal components, apply SVD to the trajectory matrix X, as X = U Σ V T , where U is the matrix of the eigenvectors (left singular vectors), Σ is the diagonal matrix of the singular values (λ 1 , . . . , λ L ) and V is the matrix of the right singular vectors of X. In this notation, the trajectory matrix X can be written as where X i = √ λ i U i V T i is the elementary matrix, and d = min{K, L} is the rank of X (matrices X i have rank 1). The sequence of elements of the i-th eigenvector U i is defined as the i-th Empirical Orthogonal Function (EOF) (see Figure 5).

3.
Grouping. The principal components are grouped into sets, aiming at representing the different patterns present in the data (e.g., noise, periodicity, trend). This corresponds to partitions in the set of indices {1, . . . , d} into p disjoint subsets S 1 , . . . , S p . 4.
Reconstruction. Given a subset of indexes S i = {i 1 , ..., i l }, an approximation of X is reconstructed as a sum of the corresponding elementary matrix only: Then, applying the Hankelization ofX i , the time series For the sake of brevity, Hankelization and averaging are not reported here; see Ref. [34] for details. The SSA output, for the remote and contact analysis, is reported in the Figure 6.
Here, no PC grouping procedure was employed (i.e., p = d) taking into account the EOFs individually. In particular, only the first three EOFs were considered for the RR extraction: RR EOFi is estimated as the highest peak in the respiratory power spectral density range of z i . Then, the highest estimate among the RR EOF1 , RR EOF2 , RR EOF3 was chosen as the RR SSA value. (c)

Dataset
The study was carried out employing the BP4D+ dataset [23], since it is one of the few available datasets that provides face video frames, contact references, and a real respiratory ground truth for each subject. The BP4D+ dataset, an extension of the BP4D database, is a Multimodal Spontaneous Emotion Corpus (MMSE), which collects 3D, 2D, thermal, and physiological data sequences (e.g., heart rate, blood pressure, skin conductance (EDA), and respiration rate), and meta-data (facial features and FACS codes). The dataset collects data of 140 subjects, 58 males and 82 females, with ages ranging from 18 to 66 years old.
Subjects of various ethnicities participated in the data acquisition process (including East-Asian and Middle-East-Asian, Hispanic/Latino and Native American). For each subject, 10 tasks (eliciting different emotional states) were included in the database. Frames were acquired at 25 fps, and physiological signals were sampled at 1000 Hz. For the aim of this study, only the thoracic-impedance respiratory reference, the contact reference, and the subject's face video frames were taken into account, considering all the subjects and all the tasks.
As stated in Ref. [31], some signals in the BP4D+ respiratory ground truth are strongly affected by artifacts and disturbance, therefore the thoracic-impedance respiratory reference signals were pre-processed to check their reliability as a respiratory target. For each task and subject, the respiratory reference was filtered with a second-order Butterworth bandpass filter, with cutoff frequencies of 0.1 Hz and 0.5 Hz, and normalized between the [−1,1] amplitude range. Then peak and troughs detection was performed, so that the standard deviation σ 1 of the peak-to-peak time intervals and the standard deviation σ 2 of the heights of the troughs (minima) could be computed. A signal was considered "corrupted" if σ 1 > 1 or σ 2 > 0.2 (leading to the rejection of the related data). Additionally, each signal with a duration shorter than 30 s was discarded. This results in 209 accepted signals.

Analysis
The single-channel blind source separation methods (EMD and SSA) and the RIV extraction technique via (r)PPG signal segmentation (IMS) are adopted during the RR estimation phase, processing the remote PPGs and contact references of each BP4D+ video. The general procedure for computing RR is depicted at a glance in Figure 7, and can be briefly recapped as follows: 1.
Pre-processing: The rPPG signal associated with the i-th video is filtered using a fourth-order Butterworth band-pass filter with cut-off frequencies of [0.18 Hz, 1.0 Hz] for each temporal window indexed by j.

2.
Methods: The filtered signal is processed by one of the analysed approaches (IMS, EMD, SSA) in order to extract respiratory related information.

3.
Post-processing: Each approach yields a number of estimates that are subsequently post-processed with an artifact reduction technique. It basically consists of a cubicspline interpolation of the original estimate followed by the computation of the fist derivative of the obtained signal (cfr. Equation (2)).

4.
RR estimation: RR is obtained by choosing the most prominent peak in the Power Spectral Density (PSD) estimated with the periodogram of the post-processed signal. The frequency range within which RR is picked is adaptively set, based on the heart rate extracted from the same signal.
The obtained RR-estimates are then compared to the corresponding ground truth references via commonly used error metrics. Contact references are processed in the same way, except for the post-processing phase, which involves only the computation of the fist derivative of the signal.

Error Metrics
For each estimator, the obtained RR estimate is compared to the RR extracted from the thoracic-impedance respiratory reference using the following metrics: • Mean Absolute Error (MAE) The Mean Absolute Error measures the average absolute difference between the estimatedĥ and reference RR h. It is computed as: Smaller MAE values suggest better predictions. The MAE is a fairly interpretable measure, as it provides the average distance in terms of breaths per minute of the predictions with regard to the ground truth. • Root Mean Squared Error (RMSE). The Root-Mean-Squared Error measures the difference between quantities in terms of the square root of the average of squared differences, that is, RMSE represents the sample standard deviation of the absolute difference between the reference and measurement, that is, a smaller RMSE suggests more accurate extraction. In contrast to the MAE, few large differences increase the RMSE to a greater degree due to the squaring of the differences.
Results for all the aforementioned estimators applied to both contact and remotely estimated signals are reported in Table 1.

Bland-Altman Analysis
In order to assess the level of agreement between the employed rPPG-based RR estimation methods and the reference, Bland-Altman analysis [35] has been employed. It allows to quantify the difference between measurements using a graphical method. A scatter-plot (Bland-Altman plot) is produced in which the X-axis represents the average of the two measurements, and the Y-axis represents their difference.
For every pair (RR estimator , RR target ), the mean value (i.e., what is likely to be interpreted as a RR trade-off between expectation and the true value) versus the corresponding error (i.e., how reliable the compromise is for the RR measurement) are considered: The sign of this quantity allows to unveil the presence of eventual systematic biases in the estimation. Specifically, negative errors indicate that the RR is, on average, underestimated, while positive errors suggest that the RR is typically overestimated. Figure 8 reports Bland-Altman plots for the RR median and RR EOF3 estimators acting on contact and remote-PPG signals.
As can be noted, the RR median estimator (Figure 8a,c) is unbiased (Error RR ∼ 0) and does not exhibit any linear dependency between the average of the two measurements and their difference for both the contact and remote signals. On the contrary, an inspection of the RR EOF3 Bland-Altman plot (Figure 8b,d)

Significance Testing
Finally, statistical analyses have been carried out to evaluate the significance of differences in the performance of the 14 estimators (populations) on 209 samples (paired videos). Following Ref. [36], the populations were checked with the Shapiro-Wilk test for normality, and with Levene's test for homogeneity. Due to the rejection of the null hypothesis of normality for at least one population, the non-parametric Friedman's test and the associated Nemenyi test were employed as the omnibus test and for post hoc analysis, respectively.
Friedman's test rejected the null hypothesis of equality of the medians of the population distributions (p < 0.05); hence, a statistically significant difference exists between the analysed estimators. The Nemenyi post hoc test was thus employed for the assessment of the differences between each population. The output of the post hoc Nemenyi test can be visualized through the Critical Difference (CD) diagram [36]; CD diagrams show the average rank of each estimator (higher ranks meaning lower average errors); models whose difference in ranks do not exceed the CD α (α = 0.05) are joined by thick lines and cannot be considered significantly different. Results for the MAE and RMSE metrics obtained from the remote-PPG signal are depicted at a glance in Figures 9 and 10.
Tables 2 and 3 report the results of the above procedure for the RMSE and MAE metrics, respectively. Estimators are ranked with regard to the remote-RR estimation performances according to the metric at hand. Moreover, the magnitude of the difference between the estimators is reported in terms of Akinshin's gamma effect size.

Comparison with a Deep Learning-Based Approach
The approaches evaluated in this work so far pledge to extract respiratory-related information by exploiting the well-known intertwining between the cardiac and respiratory system; on such basis, the signal processing-based methods surveyed here allow to extract the physiological signals of interest with varying levels of accuracy. It is interesting to confront such approaches relying on domain knowledge with modern alternatives that allow to automatically learn these relationships from data; this is usually accomplished by employing end-to-end deep neural network (DNN) models. DNNs have gained significant attention in many disciplines, including computer vision, in virtue of their superior performance exhibited for a variety of tasks when compared to traditional approaches that require manual feature design.
Notably, techniques for recovering physiological signals via DNN-based methods have also emerged. The latter have become increasingly popular in the related literature, as evidenced by recent reviews, for example, Ref. [37,38]. Despite the reported remarkable performances, few DNN solutions have been made publicly available in terms of both code and learned model weights. This lack of availability raises concerns about the reproducibility of results and the ability to properly assess these methods.
Deep remote RR estimation makes no exception; in the last few years, a variety of approaches have been proposed for predicting respiratory rates or waveforms directly from RGB videos [39][40][41][42]. Unfortunately, to the best of our knowledge, only a handful of them have been made available to the scientific community. One of them is represented by MTTS-CAN, a neural architecture proposed in Ref. [40], which employs a tensor-shift module and 2D-convolutional operations to efficiently perform spatial temporal modeling, thus enabling real-time cardiovascular and respiratory measurements. MTTS-CAN is end-to-end trained to predict both Blood Volume Pulse (BVP) signals and Respiratory Waveforms (RWs) from videos displaying human faces. Model optimization is shaped as a multi-task learning problem. At test time, BVPs as RWs can be easily obtained by feeding the network with a given video sequence; no pre-processing steps are required, except for performing trivial image normalization procedures.
It is worth noting that MTTS-CAN is designed to work in an end-to-end manner and does not estimate respiratory information from rPPG signals, as intended. Yet, the multitask temporal shift module employed to extract both cardiac and respiratory information is eventually capable of leveraging both sources in order to deliver more robust estimates. Consequently, it lends itself well to be compared with the approaches presented here.
Given a RW estimated by MTTS-CAN, RR can be eventually obtained via spectral analysis (cfr. point 4 in Section 4.2). The Bland-Altman plot depicting the results obtained by MTTS-CAN on the BP4D+ dataset is displayed in Figure 11. As can be observed, MTTS-CAN tends, on average, to slightly overestimate the respiration rates on BP4D+; on the contrary, no marked linear dependencies between the average of the two measurements and their difference are noticeable. The standard deviation of the errors is comparable with the best-performing signal processing-based approaches.
In addition, to compare knowledge-based signal processing techniques with DNNbased approaches, we follow the same experimental and significance testing procedure, as described earlier in Section 4.2.3. Specifically, we compare the RR estimates produced by MTTS-CAN and benchmark its performance against the three best estimators from each method examined in this study (RR avg , RR EOF2 , and RR I MF2 ). We present the results in Table 4 and include the corresponding CD diagrams for both the MAE and RMSE metrics in Figure 12.  Surprisingly, despite a fairly acceptable ranking with regard to the signal processingbased approaches, the RR estimates delivered by MTTS-CAN appear to be significantly less accurate than those obtained by the RR avg estimator; the latter proves to be the most reliable approach among those benchmarked in the present work. MTTS-CAN results are not significantly different from those yielded by the RR EOF2 estimator, while the EMDbased estimator delivers the worst result.

Discussion and Conclusions
This paper has conducted a statistical evaluation of prominent methods for remotely estimating respiratory rates through rPPG signals. Specifically, we compared the performances of the IMS algorithm (a well-known method extracting morphological PPG features related to respiratory information) and two single-channel BSS approaches. Experiments have been carried out on a publicly available dataset, promoting greater comparability and reproducibility of the findings.
The results can be best summarized by inspecting Figure 9; the RR avg and RR median estimators are the best-performing. Notably, the difference with the other estimators appears to be statistically significant. No significant differences were found between RR avg and RR median ; yet, these significantly outperform RR PCA (albeit with a small magnitude effect size). Overall, according to the RMSE metric, the IMS-derived estimators delivered the most accurate results.
The remote SSA-derived RR EOF2 estimator performs fairly well in comparison to the IMS-derived ones; this is due to the fact that the second EOF extracted by SSA is presumably devoted to the extraction of the RIIV (cfr. Figure 6). For the same reason, RR I MF2 performed quite well for both contact and remote methods, while on average, the other EMD-related estimators performed worse with regard to the other estimation methods. Interestingly enough, the third EOF and the first IMF both exhibited bad results, as probably often associated with the cardiac information rather that the respiratory one (crf. Figures 1 and 6). RR I MF3 strongly underestimates the RR, meaning that it is heavily affected by low-frequency artifacts. Similar conclusions can be drawn by inspecting the MAE-related CD diagram ( Figure 10). Furthermore, we have conducted a comparison between the three top-performing signal processing-based methods for extracting respiratory information from rPPG waveforms and a newly proposed, state-of-the-art deep learning-based method apt at jointly estimating cardio-respiratory data from RGB videos. The findings indicate that the method based on extracting and merging Respiratory-Induced Variations (RIVs) from morphological features achieves the highest accuracy in comparison to the other signal processing or Deep Learning-based methods adopted here, as measured by both RMSE and MAE metrics.
Based on the assessment conducted in this paper, it can be concluded that the estimation of RR using morphological features of the PPG signals is the most dependable method. Moreover, it can be observed that the three RIVs basically convey the same information, which is mostly related to respiration, while being differently dependent on the subject's individual RR, gender or age [31], as well on the subject's health. In order to attenuate this dependency, and to make the estimation less susceptible to interference (largely when the subject's motion is detected in the measurement), the application of fusion methods (mean, median or PCA) should be preferred.
For what concerns the single channel blind source separation techniques (SSA, EMD) the main problem is that, depending on the spectral characteristics of the artifacts in the recorded phenomena, nothing guarantees that the respiratory oscillation will be always displayed by the same IMF or the same EOF. Results show that the second IMF or the second EOF modes are the most suitable sources of respiration-related oscillations; yet, this does not provide a guarantee-results might depend on the number of low-frequency artifact sources that affect the signal and on the specific frequency subrange in which the true RR falls (i.e., the current physiological conditions of the monitored subject). This problem has been identified in the EMD literature with the terms "mode mixing", when a single source contains different oscillatory modes which actually are separated sources, and "mode splitting", when an actual single source is displayed in different extracted sources [43].
Although some improved techniques (such as the complementary ensemble EMD, the complete EEMD, the partly EEMD, the noise-assisted multivariate EMD, NA-MEMD, and the fast multivariate EMD, FMEMD) have been proposed to deal with mode mixing and mode splitting, the problem is still open [44]. In particular, for what concerns respiration, mode splitting is mainly caused by the fact that respiration is a spontaneously modulated phenomena that spreads differently in the low-frequency spectrum range, according to physical, cognitive and emotional demands; thus, it is often "naturally" split in timesubsequent different modes. Tracking those splits could be trickier than tracing respiration in another way.
Lastly, upon comparing the results obtained from the use of contact and remote PPGs, we noticed only a minor variation. To quantify this remark, it is worth noting that the estimates achieved by the most successful method, IMS (first six rows of Table 1) adopting either contact or remote PPGs, have an average difference of 0.47 breaths/min and 0.53 breaths/min in RMSE and MAE, respectively. Notably, the adoption of remote PPG only involves a negligible worsening of performance.
To summarize, the experiments reported here by and large show that, on the adopted dataset, the approach based on the extraction of respiration-related morphological features from (r)PPG signals should be preferred over single-channel blind source separation techniques (either SSA or EMD). More specifically, the estimation and fusion of the three Respiratory Induced Variations (RIVs) from (r)PPG signals (RIIV, RIFV, RIAV) allow for the achievement of the best results in terms of both RMSE and MAE. Empirical Mode Decomposition (EMD) performed worse on our benchmark, despite being one of the most widely employed single-channel source separation techniques for the extraction of respiratory information from PPG signals. Conversely, Singular Spectrum Analysis allowed us to separate respiratory information with higher levels of accuracy. Interestingly enough, SSA estimates performed similarly to a state-of-the-art pre-trained Deep Learning-based method. Notably, the difference in performances between the latter and the best-performing signal processing-based approach resulted to be significant with a medium effect size. Ultimately, the rPPG has been shown to result in only a slight decrease in performance when compared to contact PPG, thus enabling its dependable and extensive applicability.