Multi-ROI Spectral Approach for the Continuous Remote Cardio-Respiratory Monitoring from Mobile Device Built-In Cameras

Heart rate (HR) and respiratory rate (fR) can be estimated by processing videos framing the upper body and face regions without any physical contact with the subject. This paper proposed a technique for continuously monitoring HR and fR via a multi-ROI approach based on the spectral analysis of RGB video frames recorded with a mobile device (i.e., a smartphone’s camera). The respiratory signal was estimated by the motion of the chest, whereas the cardiac signal was retrieved from the pulsatile activity at the level of right and left cheeks and forehead. Videos were recorded from 18 healthy volunteers in four sessions with different user-camera distances (i.e., 0.5 m and 1.0 m) and illumination conditions (i.e., natural and artificial light). For HR estimation, three approaches were investigated based on single or multi-ROI approaches. A commercially available multiparametric device was used to record reference respiratory signals and electrocardiogram (ECG). The results demonstrated that the multi-ROI approach outperforms the single-ROI approach providing temporal trends of both the vital parameters comparable to those provided by the reference, with a mean absolute error (MAE) consistently below 1 breaths·min−1 for fR in all the scenarios, and a MAE between 0.7 bpm and 6 bpm for HR estimation, whose values increase at higher distances.


Introduction
The monitoring of vital signs such as the respiratory rate (f R ), heart rate (HR), body temperature, and blood pressure is essential to assess general health status [1]. Among others, HR is an essential indicator. Heart function is affected by many factors such as psychosocial stress, smoking, excessive use of alcohol, malnutrition, lack of physical activity, and congenital diseases [2]. Even f R plays a vital role in assessing the health status of a subject, providing information on physical deterioration, and in the prediction of cardiac arrest. Moreover, it is related to various stressors (e.g., emotional stress, cognitive load) [3]. Therefore, information about HR and f R can be used in a wide range of applications such as medical diagnostics, fitness assessment, and mental stress analysis. Most traditional methods used to measure HR and f R require physical contact with the subject, which represents the main limitation in their usage (e.g., electrocardiography via skin-attached electrodes for HR measure, flow sensors for f R measurement).
The use of non-contact methods may extend the capability of monitoring health status of users directly in the home environment, thus making possible early diagnosis of disease or the continuous tracking of chronic ones in a less constrained setting outside hospital clinics [4]. Furthermore, with the outbreak of the COVID-19 pandemic, the need to monitor patients remotely to avoid the overload of hospitals as well as to develop telemedicine services has become increasingly assertive. Among the wide range of existing non-contact processing of the video to obtain the pulsatile and the breathing-related signals, and (ii) post-processing the raw signals for HR and f R estimation.
The working principle behind the extraction of the pulsatile and breathing-related signals from a video is explained in the following section.

Remote Photoplethysmography: Principle of Work
Remote photoplethysmography (r-PPG) is a technique for remote measurement of human cardiac activities by detecting subtle color variations visible on the subject's skin utilizing a multi-wavelength RGB camera [29]. r-PPG shares the same principle as PPG, which involves the measurement of light transmitted through or reflected from the skin, capturing the changes in blood volume [30]. The optical properties of human skin are determined by the presence of different chromophores in the layers of the skin. However, skin color is mainly given by the presence of melanin in the epidermis and the hematic pigments (i.e., hemoglobin, oxyhemoglobin, beta carotene, and bilirubin) present in the dermis/hypodermis vascular plexus. Among the hematic pigments, the primary source of rhythmic variation in light absorbance and reflectance is hemoglobin present in blood vessels [31]. When the light falls on the skin surface, the variation of the intensity of reflected light due to the pulse wave activity can be captured by a camera.
The observed intensity depends on the camera and light source distance to the measurement point in the subject's skin. Subtle changes can be observed over time in the color depending on the blood circulation, movement, and specular variation in the body.

Hardware for Video Data Recording
The measuring system used in this work consists of a built-in digital smartphone camera (iPhone 6s, Apple Inc., Cupertino, CA, USA) with a resolution of 1280 × 720 pixels with a framerate of 30 fps. The camera is an 8 bits per channel device. The images in RGB space are characterized by the property that each color can be represented using the superposition of three values (i.e., one for each channel), which encode the intensities of red (R), green (G), and blue (B), contributing to the specific color. Since the smartphone camera used in this work has automatic white balance (AWB) functions that allow the variation of the camera's exposure parameters during recording, to avoid AWB, the videos were recorded using the ProMovie Recorder App (by Panda Apps Limited), which allows locking the exposure parameters at the beginning of the acquisition and keeping them fixed for the whole duration of the video. The shutter and ISO parameters were set up at the start of the acquisition.

Algorithm for the Video Preprocessing
The preprocessing of the recorded video was performed offline via a custom-made algorithm developed in MATLAB environment.

Identification of Regions of Interest (ROIs) on Face and Torso
Firstly, a video including the face and torso regions is recorded using the built-in camera. The pit-of-the-neck and the face are identified in the image automatically via the Viola-Jones algorithm [38]. The ROI at the level of the torso is created from the pit of the neck to retrieve the chest wall movements associated with the respiratory activity [39]. Then, the face is automatically identified, and three ROIs are detected in the first video frame from the face: the right (hereinafter RCheek) and left cheeks (LCheek) and the forehead (FHead) since they are the facial regions with good vascularization [40]. Face ROIs were square shape and size equal to 5% of the frame height for video recorded at 0.5 m and 3% for video recorded at a distance of 1 m.

Respiratory Signal from the ROI Torso
The respiratory signal is extracted from the pit-of-the-neck ROI using a method that relies on the computation of optical flow (hereinafter OF), which allows the estimation of the displacement between two consecutive images by tracking the features of the images [41,42]. Figure 1 reports the proposed framework for the extraction of breathing waveform and the continuous estimation of f R . neck to retrieve the chest wall movements associated with the respiratory activity [39]. Then, the face is automatically identified, and three ROIs are detected in the first video frame from the face: the right (hereinafter RCheek) and left cheeks (LCheek) and the forehead (FHead) since they are the facial regions with good vascularization [40]. Face ROIs were square shape and size equal to 5% of the frame height for video recorded at 0.5 m and 3% for video recorded at a distance of 1 m.

Respiratory Signal from the ROI Torso
The respiratory signal is extracted from the pit-of-the-neck ROI using a method that relies on the computation of optical flow (hereinafter OF), which allows the estimation of the displacement between two consecutive images by tracking the features of the images [41,42]. Figure 1 reports the proposed framework for the extraction of breathing waveform and the continuous estimation of fR. A prior transformation from images in RGB space to greyscale images is required to apply the OF to video frames. In this paper, the Horn and Schunck (HS) algorithm is used to extract OF from the frames [43]. The algorithm assumes that pixels maintain their intensity along their trajectory. Assuming a constant brightness, the intensity of a pixel I (x, y, f) at the frame f will remain stable for short time and small movements.
For a single frame step df, the following equation is valid: where dx and dy denote the displacements in x and y directions. Assuming that the pixel displacement is sufficiently small, Equation (2) is obtained: where vx and vy are the components of the pixel velocity along the x-axes and the y-axes of the OF of the I (x, y, f) that must be determined. The HS algorithm computes the displacement between two consecutive images by tracking the image features on a pixel-by-pixel basis. In this way, a velocity vector for each pixel in the image is obtained. According to [43], the velocity component along the y-axis (i.e., v(y,f)) was assumed as the most related to movements of the ribcage caused by breathing. The HS optical flow was applied to all video frames, and the values of the yvelocity vectors within the ROI were averaged to obtain a single value per frame. The average velocity vector (vy) was integrated to obtain the linear displacement of the rib cage related to the respiratory activity (sy). A prior transformation from images in RGB space to greyscale images is required to apply the OF to video frames. In this paper, the Horn and Schunck (HS) algorithm is used to extract OF from the frames [43]. The algorithm assumes that pixels maintain their intensity along their trajectory. Assuming a constant brightness, the intensity of a pixel I (x, y, f) at the frame f will remain stable for short time and small movements.
For a single frame step df, the following equation is valid: where dx and dy denote the displacements in x and y directions. Assuming that the pixel displacement is sufficiently small, Equation (2) is obtained: where v x and v y are the components of the pixel velocity along the x-axes and the y-axes of the OF of the I (x, y, f) that must be determined. The HS algorithm computes the displacement between two consecutive images by tracking the image features on a pixel-by-pixel basis. In this way, a velocity vector for each pixel in the image is obtained. According to [43], the velocity component along the y-axis (i.e., v(y,f)) was assumed as the most related to movements of the ribcage caused by breathing. The HS optical flow was applied to all video frames, and the values of the y-velocity vectors within the ROI were averaged to obtain a single value per frame. The average velocity vector (v y ) was integrated to obtain the linear displacement of the rib cage related to the respiratory activity (s y ).

r-PPG Signal from the ROIs Face
Three time-dependent RGB signals are obtained by spatially averaging the intensity of the pixels within the ROI for each frame. In this way, the raw signals for each ROI are obtained: x x x with i = 1,..., N, where N is the number of frames. We then proceed with the conditioning of the raw signals, which involves two steps: (i) raw r-PPG signals detrending [18,44,45] and (ii) smoothness prior approach (SPA) [10,15,[46][47][48][49]. At this point, for each ROI, we obtain a three-dimensional signal in time, where the three RGB channels represent the three dimensions. To proceed with a spectral analysis of the signals for HR extraction, it is necessary to reduce the dimensionality of the signal to obtain a one-dimensional signal as a function of time. The choice of which method to use is one of the most debated issues in the r-PPG literature. In this paper, six post-processing techniques have been implemented and compared. They can be divided into color-based methods (i.e., Green Channel Analysis, modwtmra, Chrominance-based signal processing method, Plane Orthogonal to Skin) and Blind Source Separation Analysis (i.e., Independent Component Analysis, Principal Component Analysis). The framework of the proposed r-PPG based HR extraction method is as shown in Figure 2.
with i = 1, ..., N, where N is the number of frames. We then proceed with the conditioning of the raw signals, which involves two steps: (i) raw r-PPG signals detrending [18,44,45] and (ii) smoothness prior approach (SPA) [10,15,[46][47][48][49]. At this point, for each ROI, we obtain a three-dimensional signal in time, where the three RGB channels represent the three dimensions. To proceed with a spectral analysis of the signals for HR extraction, it is necessary to reduce the dimensionality of the signal to obtain a one-dimensional signal as a function of time. The choice of which method to use is one of the most debated issues in the r-PPG literature. In this paper, six post-processing techniques have been implemented and compared. They can be divided into color-based methods (i.e., Green Channel Analysis, modwtmra, Chrominance-based signal processing method, Plane Orthogonal to Skin) and Blind Source Separation Analysis (i.e., Independent Component Analysis, Principal Component Analysis). The framework of the proposed r-PPG based HR extraction method is as shown in Figure 2.

Green Channel Analysis
The green channel analysis is the most used method to derive plethysmographic signals from the analysis of RGB camera recorded videos [14,18,[50][51][52]. As demonstrated in the study proposed by Verkruysse and colleagues [53] in 2008, the green channel offers the best signal quality in terms of signal-to-noise ratio (SNR) and accuracy in HR estimation, whereas red and blue are close to the noise value. The signal obtained with this technique is hereinafter reported as r-PPGGreen.

Modwtmra
The modwtmra is an algorithm that can be used to enhance the quality of the r-PPG signal obtained from the green channel [54]. The raw signal was filtered using a wavelet symlet filter of the fourth order with decomposition level 4 assumed, allowing the

Green Channel Analysis
The green channel analysis is the most used method to derive plethysmographic signals from the analysis of RGB camera recorded videos [14,18,[50][51][52]. As demonstrated in the study proposed by Verkruysse and colleagues [53] in 2008, the green channel offers the best signal quality in terms of signal-to-noise ratio (SNR) and accuracy in HR estimation, whereas red and blue are close to the noise value. The signal obtained with this technique is hereinafter reported as r-PPG Green .

Modwtmra
The modwtmra is an algorithm that can be used to enhance the quality of the r-PPG signal obtained from the green channel [54]. The raw signal was filtered using a wavelet symlet filter of the fourth order with decomposition level 4 assumed, allowing the extraction of the signals originating from the heart [54]. The signal obtained with this technique is hereinafter reported as r-PPG MODW .

Chrominance-Based Signal Processing Method (CHROM)
The chrominance-based signal processing method (CHROM) was first introduced by De Haan and Jeanne [35] to enhance the robustness in the estimation of r-PPG signal. The CHROM technique removes components unrelated to the r-PPG signal by projecting RGB channels into a chrominance subspace. The r-PPG signal is obtained from the difference between two chrominance signals. It will be reported as r-PPG CHROM . Plane Orthogonal to Skin (POS) Plane Orthogonal to Skin uses a different projection orthogonal to the skin tone than the CHROM method, and it is considered more robust in complex illumination scenarios [34]. As in the CHROM method, the r-PPG signal (hereinafter reported as r-PPG POS ) is obtained from the linear combination of two projection axes giving in-phase signals.

Blind Source Separation Analysis
A technique often used for noise removal from physiological signals is Blind Source Separation (BSS), which consists of reconstructing source signals from a set of observed signals. Typically, observations are acquired from the output of an array of sensors, where each sensor receives a different combination of the source signals.
There are several methods of BSS, including Independent Component Analysis (ICA) and Principal Component Analysis (PCA).
ICA technique was first used in the field of remote photoplethysmography by Poh et al. [32] in 2010. The ICA aims to decompose a linear combination of sources under the assumptions of independence and non-Gaussianity [55,56] to estimate the source signal. There are several algorithms to perform ICA, but the JADE ICA algorithm [57] is the most widely used in the field of remote photoplethysmography [9,14,15,32,33,47,48]. The signals obtained with this technique are hereinafter reported as r-PPG ICA1 , r-PPG ICA2 and r-PPG ICA3 as results of the first, second, and third ICA estimations.
Similar to ICA, the PCA technique allows the identification of the original source signals (s 1 (t), s 2 (t), and s 3 (t)) from a set of observed signals (x 1 (t), x 2 (t), and x 3 (t)) by recovering the directions along which the data have maximum variance. Its goal is to represent these data as a set of new orthogonal variables named principal components [58]. The signals obtained with PCA are hereinafter reported as r-PPG PCA1 , r-PPG PCA2 , and r-PPG PCA3 as results of the first, second, and third PCA estimations, respectively.
Since ICA and PCA return the independent and principal components randomly, the component whose power spectrum contains the highest peak was selected for further analysis [47].

Tests and Experimental Trials Participants and Tests
In this study, eighteen volunteers (i.e., 15 males, 3 females, age range 20-39 years old, height 163-177 cm, body mass 55-83 kg) were enrolled. Videos were recorded in a 30 m 2 unstructured environment (i.e., a laboratory) equipped with three windows that provide natural light. All experiments were carried out indoors and with a stable amount of light delivered by neon lights. No restrictions have been placed on the clothing of the subjects.
Each volunteer was called to perform four trials while seated and breathing quietly. As reported in Table 1, we investigated two different user-camera distances (0.5 m and 1.0 m) and two illumination conditions (natural light or artificial light), which resemble circumstances that can be found in the home, office, and clinical settings. A light ring (with a diameter of 48 cm and a total power of 55 W) was used to provide the additional (artificial) light source (set light temperature 3500 K). The light ring was installed on a tripod. On the same tripod, the smartphone used for capturing videos was positioned ( Figure 3). tripod. On the same tripod, the smartphone used for capturing videos was positioned ( Figure 3).   Table 1 for the operating conditions).
A multi-parameter wearable device, the Zephyr BioModule BioHarness 3 by Medtronic (BH3), was used to record the reference respiratory and electrocardiographic (ECG) signal contextually to the video recording. This system consists of a thoracic belt and an electronic module. It acquires the user's breathing pattern by sensing the volumetric changes in the thorax employing a strain gauge and the ECG waveform via dry electrodes. The reference breathing and ECG signals were sampled at 25 Hz and 250 Hz, respectively.
The protocol used was the same for the four trials: subjects at the beginning of each trial performed a full-lung apnea lasting ~5 s to allow synchronization between the data acquired by the reference system and those acquired by the video recording system. After the apnea, subjects continued to breathe quietly until the end of the test. Subjects were asked not to wear glasses, as their presence has been shown to affect r-PPG signal detection negatively [18]. In addition, all subjects were asked not to wear make-up. At the beginning of each trial, the environmental light intensity values of the scenario were measured using a light meter (see Table 1).
Four videos-of approximately 90 s each-were recorded for each subject under different conditions. All the tests were carried out in compliance with the Ethical Approvals (ST-UCBM 27/18 OSS), and prior to the tests, all the participants provided their informed consent.  Table 1 for the operating conditions).
A multi-parameter wearable device, the Zephyr BioModule BioHarness 3 by Medtronic (BH3), was used to record the reference respiratory and electrocardiographic (ECG) signal contextually to the video recording. This system consists of a thoracic belt and an electronic module. It acquires the user's breathing pattern by sensing the volumetric changes in the thorax employing a strain gauge and the ECG waveform via dry electrodes. The reference breathing and ECG signals were sampled at 25 Hz and 250 Hz, respectively.
The protocol used was the same for the four trials: subjects at the beginning of each trial performed a full-lung apnea lasting~5 s to allow synchronization between the data acquired by the reference system and those acquired by the video recording system. After the apnea, subjects continued to breathe quietly until the end of the test. Subjects were asked not to wear glasses, as their presence has been shown to affect r-PPG signal detection negatively [18]. In addition, all subjects were asked not to wear make-up. At the beginning of each trial, the environmental light intensity values of the scenario were measured using a light meter (see Table 1).
Four videos-of approximately 90 s each-were recorded for each subject under different conditions. All the tests were carried out in compliance with the Ethical Approvals (ST-UCBM 27/18 OSS), and prior to the tests, all the participants provided their informed consent.

Data Analysis
The collected videos were post-processed in MATLAB environment to extract both the breathing patterns and the r-PPG signals from the described post-processing techniques ( Figure 4). We retrieved the s y signal from each trial by processing the optical flow method and the r-PPG signals with the six implemented post-processing techniques used to retrieve the cardiac pulsatile waveform. Before data processing, all the r-PPG signals were filtered in the range 0.5 Hz-2.5 Hz (equivalent to 30-150 bpm range) to emphasize the pulsatile component caused by the heart beating. Even the respiratory signals obtained from the video were filtered with a first-order Butterworth bandpass filter between 0.05 Hz and 2 Hz to remove the high frequencies related to noise but preserve all the possible f R values (range 3-120 breaths·min −1 ).
The collected videos were post-processed in MATLAB environment to extract both the breathing patterns and the r-PPG signals from the described post-processing techniques (Figure 4). We retrieved the sy signal from each trial by processing the optical flow method and the r-PPG signals with the six implemented post-processing techniques used to retrieve the cardiac pulsatile waveform. Before data processing, all the r-PPG signals were filtered in the range 0.5 Hz-2.5 Hz (equivalent to 30-150 bpm range) to emphasize the pulsatile component caused by the heart beating. Even the respiratory signals obtained from the video were filtered with a first-order Butterworth bandpass filter between 0.05 Hz and 2 Hz to remove the high frequencies related to noise but preserve all the possible fR values (range 3-120 breaths · min −1 ). All the r-PPG signals obtained from the video were synchronized with the ECG waveform by using the respiratory traces recorded from the video (sy) and the reference breathing signal from BH3. The endpoint of the apnea was used as a synchronization point on both the sy and the reference system signal. The first 60 s of the video were considered for the analysis.
After this process, the analysis in the spectral domain was carried out to estimate the values of HR and fR both for reference and video signals. The Power Spectrum Density (PSD) was computed using the Lomb periodogram through a moving window of 20 s [51], sliding by 1 s. The window length was selected according to [51], in which windows with sizes between 15 s and 20 s were found to represent a good deal between time resolution and noise robustness.
The values of HR and fR were computed by considering the frequency at which occurs the highest peak of the PSD in each window, according to the following equation: All the r-PPG signals obtained from the video were synchronized with the ECG waveform by using the respiratory traces recorded from the video (s y ) and the reference breathing signal from BH3. The endpoint of the apnea was used as a synchronization point on both the s y and the reference system signal. The first 60 s of the video were considered for the analysis.
After this process, the analysis in the spectral domain was carried out to estimate the values of HR and f R both for reference and video signals. The Power Spectrum Density (PSD) was computed using the Lomb periodogram through a moving window of 20 s [51], sliding by 1 s. The window length was selected according to [51], in which windows with sizes between 15 s and 20 s were found to represent a good deal between time resolution and noise robustness.
The values of HR and f R were computed by considering the frequency at which occurs the highest peak of the PSD in each window, according to the following equation: Figure 5 shows the temporal trends of the estimated values of f R and HR against the reference values. Sometimes, some estimated HR values are inconsistent with the other values since r-PPG signals are affected by artifacts. These values are defined as outliers ( Figure 6). Considering the total vector of estimated values, those falling outside the interquartile range multiplied by 1.5 were removed and replaced with the nearest non-outlier values. Figure 5 shows the temporal trends of the estimated values of fR and HR against the reference values. Sometimes, some estimated HR values are inconsistent with the other values since r-PPG signals are affected by artifacts. These values are defined as outliers ( Figure 6). Considering the total vector of estimated values, those falling outside the interquartile range multiplied by 1.5 were removed and replaced with the nearest nonoutlier values.  Three multi-ROI approaches were proposed for the HR estimation: (1) Single-ROI approach: HR values were separately estimated from the r-PPG signals extracted from the three identified ROIs (i.e., LCheek, RCheek, and FHead) per each post-processing technique. (2) Multi-ROI approach: HR values were estimated by averaging the HR values gathered from all the ROIs as the following equation: where HRLCheek, HRFHead, and HRRCheek are the HR values estimated by using the PSD analysis in each window.
(3) SNR-based approach: per each post-processing technique, the signal that better represents the pulsatile waveform was identified by evaluating the signal-to-noise ratio (i.e., SNR) according to [35]. Only the signal with the highest SNR value was used for HR estimation.
We used Bland-Altman and correlation analysis (coefficient of determination R 2 ) to investigate the agreement between the implemented approaches and the reference values. Per each approach (i.e., single-ROI, multi-ROI, SNR-based), we carried out a separate analysis to evaluate the influence of different ambient conditions on HR and fR estimation.  Figure 5 shows the temporal trends of the estimated values of fR and HR against the reference values. Sometimes, some estimated HR values are inconsistent with the other values since r-PPG signals are affected by artifacts. These values are defined as outliers ( Figure 6). Considering the total vector of estimated values, those falling outside the interquartile range multiplied by 1.5 were removed and replaced with the nearest nonoutlier values.  Three multi-ROI approaches were proposed for the HR estimation: (1) Single-ROI approach: HR values were separately estimated from the r-PPG signals extracted from the three identified ROIs (i.e., LCheek, RCheek, and FHead) per each post-processing technique. (2) Multi-ROI approach: HR values were estimated by averaging the HR values gathered from all the ROIs as the following equation: where HRLCheek, HRFHead, and HRRCheek are the HR values estimated by using the PSD analysis in each window.
(3) SNR-based approach: per each post-processing technique, the signal that better represents the pulsatile waveform was identified by evaluating the signal-to-noise ratio (i.e., SNR) according to [35]. Only the signal with the highest SNR value was used for HR estimation.
We used Bland-Altman and correlation analysis (coefficient of determination R 2 ) to investigate the agreement between the implemented approaches and the reference values. Per each approach (i.e., single-ROI, multi-ROI, SNR-based), we carried out a separate analysis to evaluate the influence of different ambient conditions on HR and fR estimation. Three multi-ROI approaches were proposed for the HR estimation: (1) Single-ROI approach: HR values were separately estimated from the r-PPG signals extracted from the three identified ROIs (i.e., LCheek, RCheek, and FHead) per each post-processing technique. (2) Multi-ROI approach: HR values were estimated by averaging the HR values gathered from all the ROIs as the following equation: where HR LCheek , HR FHead , and HR RCheek are the HR values estimated by using the PSD analysis in each window.
(3) SNR-based approach: per each post-processing technique, the signal that better represents the pulsatile waveform was identified by evaluating the signal-to-noise ratio (i.e., SNR) according to [35]. Only the signal with the highest SNR value was used for HR estimation.
We used Bland-Altman and correlation analysis (coefficient of determination R 2 ) to investigate the agreement between the implemented approaches and the reference values. Per each approach (i.e., single-ROI, multi-ROI, SNR-based), we carried out a separate analysis to evaluate the influence of different ambient conditions on HR and f R estimation. Firstly, we used the data collected at two different distances (i.e., 0.5 m and 1.0 m), using the data collected in two different lighting conditions. Then, we evaluated the influence of the two lighting conditions (i.e., natural and artificial light) using the data collected at 0.5 m and 1.0 m.
Additionally, for the multi-ROI and the SNR-based approaches, the Mean Absolute Error (MAE) and the Root Mean Square Error (RMSE) were computed. Figure 7 reports the Bland-Altman plots for the f R values estimated through OF to evaluate the influence of the user-camera distance ( Figure 7A,B) and the influence of two lighting conditions ( Figure 7C,D). The dashed line represents the Mean of Difference (MOD), and the red and blue lines represent the upper Limit of Agreement (LOA) and the low LOA, respectively. A MOD ± LOAs of −0.05 ± 3.58 breaths·min −1 is obtained when the light ring is on considering both the data collected at the distance of 0.5 m and 1.0 m, and a MOD ± LOAs of −0.02 ± 4.22 breaths·min −1 is achieved in the case of natural light ( Figure 7B). Considering all the data collected in the two lighting conditions, a MOD ± LOAs of −0.13 ± 3.43 breaths·min −1 and a MOD ± LOAs of 0.06 ± 4.34 breaths·min −1 were obtained at distances of 0.5 m and 1.0 m, respectively.

Respiratory Rate Estimation
Firstly, we used the data collected at two different distances (i.e., 0.5 m and 1.0 m), using the data collected in two different lighting conditions. Then, we evaluated the influence of the two lighting conditions (i.e., natural and artificial light) using the data collected at 0.5 m and 1.0 m.
Additionally, for the multi-ROI and the SNR-based approaches, the Mean Absolute Error (MAE) and the Root Mean Square Error (RMSE) were computed. Figure 7 reports the Bland-Altman plots for the fR values estimated through OF to evaluate the influence of the user-camera distance ( Figure 7A,B) and the influence of two lighting conditions ( Figure 7C,D). The dashed line represents the Mean of Difference (MOD), and the red and blue lines represent the upper Limit of Agreement (LOA) and the low LOA, respectively. A MOD ± LOAs of −0.05 ± 3.58 breaths · min −1 is obtained when the light ring is on considering both the data collected at the distance of 0.5 m and 1.0 m, and a MOD ± LOAs of −0.02 ± 4.22 breaths · min −1 is achieved in the case of natural light ( Figure 7B). Considering all the data collected in the two lighting conditions, a MOD ± LOAs of −0.13 ± 3.43 breaths · min −1 and a MOD ± LOAs of 0.06 ± 4.34 breaths · min −1 were obtained at distances of 0.5 m and 1.0 m, respectively. In Table 2, MAE and RMSE values are reported per each trial by considering all the subjects' data. Independently of the user-camera distance and the lighting conditions, MAE is always below 1 breaths · min −1 , and the higher RMSE is obtained at a distance of 1.0 m when the light ring is off.  In Table 2, MAE and RMSE values are reported per each trial by considering all the subjects' data. Independently of the user-camera distance and the lighting conditions, MAE is always below 1 breaths·min −1 , and the higher RMSE is obtained at a distance of 1.0 m when the light ring is off.

Heart Rate Estimation
The influence of user-camera distance and lighting conditions in the estimation of HR was investigated by analyzing LOAs values obtained from Bland-Altman plots, R 2 , MAE, and RMSE. These values were computed per each implemented post-processing technique and for the three approaches followed for HR estimation (i.e., single-ROI, multi-ROI, and SNR-based approaches). Table 3 reports the values of MOD ± LOAs obtained per each implemented postprocessing technique and for the three identified ROIs, as well as the results obtained with the multi-ROI approach (see the following section). In the majority of cases, light-off trials showed higher LOAs. The same was true in the trials with a user-camera distance of 1 m, independently of light conditions. The single-ROI approach does not allow the identification of the best region of the face for HR estimation because of conflicting results in the presence/absence of light at the two distances (first two columns) and between the two distances considering both the lighting conditions (third and fourth columns). Table 3. MOD ± LOAs obtained in estimating HR from the three ROIs (single-ROI approach) and the average HR (obtained with the mean-ROI approach) per each implemented post-processing technique in the different ambient conditions. The best results are highlighted in bold.   Figure 8 shows the values of R 2 between the HR values estimated with the three approaches (i.e., single-ROI, multi-ROI, and SNR-based approach) and the reference values per each implemented post-processing technique. The R 2 corroborate the results obtained with LOAs. Except for PCA (showing lower R 2 values in all the conditions), R 2 ranged from 0.8177 to 0.9775 in light on conditions and are lower when the light is off for all the ROIs. At a user-camera distance of 1 m, the R 2 values are generally lower than at 0.5 m.

Single-ROI Approach
values per each implemented post-processing technique. The R 2 corroborate the results obtained with LOAs. Except for PCA (showing lower R 2 values in all the conditions), R 2 ranged from 0.8177 to 0.9775 in light on conditions and are lower when the light is off for all the ROIs. At a user-camera distance of 1 m, the R 2 values are generally lower than at 0.5 m.

Multi-ROI Approach
As anticipated, Table 3 reports the results related to the HR estimation using the multi-ROI approach. Considering the average value of HR estimated with this approach, all the implemented techniques show good agreement in the different ambient conditions, and the best agreement is achieved from the POS algorithm. To evaluate the influence of lighting condition in the estimation of average HR, all data collected at the two distances were used (n = 1476, considering that 41 values were obtained per volunteer). In all the cases (light on and off, d = 0.5 m and d = 1 m but expect for the ICA), the performances achieved with the multi-ROI approach outperform those provided by the single-ROI approach (see Figure 8 and Table 3). From Figure 9 (n = 1476 per Bland-Altman plot), when the light ring is on, the HR estimations are more accurate when compared against the reference ones. All the implemented post-processing techniques show better performances at 0.5 m (higher R 2 and lower LOAs), and the POS algorithm allows obtaining the best agreement with reference values (MOD ± LOAs = 0.009 ± 2.45 bpm, R 2 = 0.9861). Compared to all the techniques, PCA was the worst.

Multi-ROI Approach
As anticipated, Table 3 reports the results related to the HR estimation using the multi-ROI approach. Considering the average value of HR estimated with this approach, all the implemented techniques show good agreement in the different ambient conditions, and the best agreement is achieved from the POS algorithm. To evaluate the influence of lighting condition in the estimation of average HR, all data collected at the two distances were used (n = 1476, considering that 41 values were obtained per volunteer). In all the cases (light on and off, d = 0.5 m and d = 1 m but expect for the ICA), the performances achieved with the multi-ROI approach outperform those provided by the single-ROI approach (see Figure 8 and Table 3). From Figure 9 (n = 1476 per Bland-Altman plot), when the light ring is on, the HR estimations are more accurate when compared against the reference ones. All the implemented post-processing techniques show better performances at 0.5 m (higher R 2 and lower LOAs), and the POS algorithm allows obtaining the best agreement with reference values (MOD ± LOAs = 0.009 ± 2.45 bpm, R 2 = 0.9861). Compared to all the techniques, PCA was the worst.  Table 4 reports the values of MAE and RMSE in bpm computed per each implemented post-processing technique in the estimation of average HR at different ambient conditions. Considering all the post-processing techniques, MAE is between 0.7 bpm and 6 bpm, and RMSE is in the range 1.

SNR-Based Approach
Considering the R 2 provided in Figure 8, the SNR-based approach estimations present lower correlation with the reference data compared to the multi-ROI approach. In line with the results obtained with the multi-ROI approach, HR values estimated at 0.5 m are closer to the reference values than those estimated at 1 m. Light-on data collections provided more accurate HR values estimations for all the post-processing techniques ( Figure 10). The values of MOD ± LOAs obtained in the estimation of HR from r-PPG signals with the highest SNR value per each implemented post-processing technique are reported in Table 5.  Data show that the POS algorithm allows obtaining the best agreement with reference values (MOD ± LOAs = 0.05 ± 2.91 bpm, R 2 = 0.9805), slightly higher than the best one found with multi-ROI approach. POS and Green Channel techniques presented close results in terms of MOD and LOAs for all the conditions as emerge from the bolded values in Table 5.
MAE and RMSE values strengthen this result (see Table 6). The values of MAE are in the range 0.60 bpm and 6.71 bpm, which are achieved for the Green Channel at 0.5-off and for CHROM at 1-off, respectively. Similarly, the lowest RMSE value is obtained for the Green Channel technique (RMSE = 1.27 bpm), and the highest is achieved for the CHROM (RMSE = 12.98 bpm). At high user-camera distance, the values of MAE and RMSE increase for all the implemented post-processing techniques except for modwtmra, which shows an MAE of 1.10 bpm at 0.5-on and an MAE of 0.84 bpm at 1-on. The values of MAE and RMSE increase at 1.0-off, and the maximum value is obtained for the CHROM algorithm (MAE = 6.71 bpm and RMSE = 12.98 bpm).

Discussion
In this paper, a novel approach for the continuous estimation of HR and f R based on a multi-ROI approach and spectral analysis of RGB video-frames recorded with a mobile device is proposed. A built-in smartphone camera is used to record videos of the subject's face and torso, allowing the recording of chest wall movements and subtle color changes related to blood volume pulse in a non-intrusive and low-cost manner. The respiratory signal is estimated by the motion of the chest, whereas the cardiac signal is from the pulsatile activity at the level of right and left cheeks and forehead. We have investigated the performance in scenarios simulating daily living to evaluate the influence of environmental factors as recording distance (i.e., 0.5, 1.0 m) and illumination conditions (natural or artificial light). We implemented and tested several commonly adopted post-processing techniques on video frames to (i) estimate HR and f R values through single, multi-ROI, and SNR-based approaches; (ii) carry out a continuous estimation of HR and f R values with an update time of 1 s; and (iii) investigate the influence of user-camera distance and light source on the performances of the implemented approaches used to obtain HR and f R values.
Considering all the trials, we obtained comparable MAE and RMSE values in the estimation of f R with the OF technique applied on the single torso ROI. Independently of the user-camera distance and the lighting conditions, MAE values are always below 1 breaths·min −1 , which is in accordance with the results obtained in [50], in which breathby-breath f R values were computed through an analysis in the time domain. Considering all data collected in the two illumination conditions, at 0.5 m, LOAs of ± 3.43 breaths·min −1 are achieved, which are slightly worse than the results obtained in [43], in which LOAs of ± 1.92 breaths·min −1 were obtained for breath-by-breath f R values computed with an analysis in the time domain, which required the identification of inspiratory peaks and appropriate outlier exclusion techniques. However, our results are better than those obtained with other contactless techniques based on video [59], such as NIR cameras (MOD ± LOAs = 2.22 ± 8.0 breaths·min −1 ) or FIR cameras (MOD ± LOAs = 0.78 ± 4.24 breaths·min −1 ).
Regarding the HR estimation, we obtained comparable MAE and RMSE in all the trials when considering the HR computed with the multi-ROI approach. MAE is between 0.7 bpm and 6 bpm per each implemented post-processing technique, and RMSE is in the range of 1.2 bpm and 9.11 bpm. These values are comparable with those obtained in previous studies where the average HR values are computed from a single-ROI approach (RMSE of 4.97 bpm in [28] and RMSE of 6.92 bpm in [60]). To the best of our knowledge, no studies have performed continuous monitoring with similar approaches. Considering the influence of user-camera distance and lighting conditions, in line with results reported in previous studies, the values of MAE and RMSE increase with the distance [9] and when the light is off [12]. Comparing the HR results obtained with the classical single-ROI with those obtained with the multi-ROI approach, the latter are better per each implemented algorithm. When the SNR-based approach is used to estimate HR values from r-PPG signals extracted from each implemented algorithm, the values of MAE and RMSE are in the range 0.60 bpm and 6.71 bpm, and 1.27 bpm and 12.98 bpm, respectively, and are comparable with those obtained when the multi-ROI approach is used per each trial. In addition, from the Bland-Altman analysis, per each implemented post-processing technique, the values of LOAs are slightly high when compared to those obtained in the multi-ROI approach.

Conclusions
Despite the absence of contact with the subject, in this study, promising results are obtained in the context of the cardiorespiratory monitoring with a built-in smartphone's camera, fostering the remote monitoring of the health status of a person and of individuals that require continuous monitoring. Our paper results demonstrated that both the proposed multi-ROI and SNR-based approaches could be reliable to estimate values of HR and f R with an update time of 1 s, providing a temporal trend of these values. However, some limitations have to be clarified, mainly related to the non-inclusion of subjects with different skin colors in the experimental trials and the non-investigation of motion artifacts' influence in the estimation of cardiorespiratory parameters. These observations may be considered for further real-time applications, even in different scenarios (e.g., during sports activities, automotive field). Further investigations will be devoted to evaluating the influence of movements on the quality of r-PPG signals and the estimation of HR values. In addition, the recording and analysis of longer videos can be helpful to evaluate the performances of the proposed method in continuous monitoring of cardiorespiratory parameters, considering a more comprehensive range of HR and f R . Further efforts will be oriented to investigate the feasibility of estimating heart rate variability indexes in the time and frequency domains from signals retrieved with the SNR-based approach to better assess the physiological and psychological conditions of a person.