Enabling Continuous Wearable Reflectance Pulse Oximetry at the Sternum

In light of the recent Coronavirus disease (COVID-19) pandemic, peripheral oxygen saturation (SpO2) has shown to be amongst the vital signs most indicative of deterioration in persons with COVID-19. To allow for the continuous monitoring of SpO2, we attempted to demonstrate accurate SpO2 estimation using our custom chest-based wearable patch biosensor, capable of measuring electrocardiogram (ECG) and photoplethysmogram (PPG) signals with high fidelity. Through a breath-hold protocol, we collected physiological data with a wide dynamic range of SpO2 from 20 subjects. The ratio of ratios (R) used in pulse oximetry to estimate SpO2 was robustly extracted from the red and infrared PPG signals during the breath-hold segments using novel feature extraction and PPGgreen-based outlier rejection algorithms. Through subject independent training, we achieved a low root-mean-square error (RMSE) of 2.64 ± 1.14% and a Pearson correlation coefficient (PCC) of 0.89. With subject-specific calibration, we further reduced the RMSE to 2.27 ± 0.76% and increased the PCC to 0.91. In addition, we showed that calibration is more efficiently accomplished by standardizing and focusing on the duration of breath-hold rather than the resulting range in SpO2. The accurate SpO2 estimation provided by our custom biosensor and the algorithms provide research opportunities for a wide range of disease and wellness monitoring applications.


Introduction
Due to the novel Coronavirus disease (COVID-19) pandemic, there is a clear need to monitor respiratory functions in outpatient settings to help assess the progression of COVID-19 during the presymptomatic, symptomatic, and recovery stages. In a recent effort to record and model the trajectories of several vital signs in hospitalized COVID-19 patients, Pimentel et al. showed that peripheral oxygen saturation (SpO 2 ) is amongst the most indicative of parameters of COVID-19 progression prior to primary outcomes, suggesting the importance to monitor SpO 2 continuously [1]. Through remote SpO 2 monitoring, accurate tracking of COVID-19 progression allows for the implementation of disease-management strategies for both timely interventions and the optimization of scarce medical resources [2].
Unfortunately, existing SpO 2 measurement devices are inconvenient for monitoring in outpatient settings. Typically, SpO 2 is measured through pulse oximeters placed at peripheral extremities such as the fingers; however, these devices obstruct normal activities The PPG signal used in this work represents the changes in the reflected light emitting diodes' (LEDs) light intensity, as detected by the photodiodes (PDs). According to the Beer-Lambert law, the intensity of the reflectance PPG measured is related to the optical path length of light traveled from the LEDs to the PDs [17] (pp. 47-48). The changes of PPG intensity with respect to each component (arterial blood, venous blood, tissue, bone, etc.) have different pulsating dynamics [20]. By using appropriate filter banks, we can leverage the cardiac pulsation of the PPG to target its arterial, pulsatile, small-signal component. Specifically, the portion of the PPG signal that is representative of cardiac pulsation and the periodic changes in blood volume is termed the alternating current (AC) component, and the baseline wander of the PPG-which is slower than the cardiac frequency-is termed the direct current (DC) component [20]. The AC and DC components of the PPG in multiple wavelengths (i.e., red and IR PPG) can reveal the oxygenation saturation of the underlying arteries [17]. More details are provided in Section 2.4.

Breath-Hold Study Design
The breath-hold study was designed to induce hypoxemia and sufficient changes in SpO 2 . This study was conducted under a protocol approved by the Georgia Institute of Technology Institutional Review Board (H21100). A total of 22 (16 males, 6 females) young volunteers were recruited for the breath-hold study and written informed consent was obtained. The number of subjects recruited exceeds that of similar studies [14,21,22]. In this dataset, two subjects were excluded for analysis. The data of one subject suffered from poor ECG quality-due to expired ECG electrodes that were inadvertently used. The data of the subject exhibited an abnormal distribution of the extracted features compared to those shown in [17] (p. 51). Specifically, the ratio of ratios (R) systematically deviates more than three standard deviations across all SpO 2 levels. Therefore, for this work, only data of the remaining 20 subjects were used for analysis. Demographic information of these 20 subjects including age, weight, height, Fitzpatrick skin type, perfusion indices (PI), etc. are summarized in Table 1. Note that the distribution of PI for red and infrared in this dataset falls well below the poor perfusion threshold (0.3%) as defined by the Food and Drug Administration (FDA) [16], suggesting this measurement site is indeed malperfused. In the breath-hold study, subjects were first asked to shave their chest hair to reduce interference. Subsequently, each subject performed 10 end-expiratory breath-holds while sitting in an upright posture with a one-minute break between breath-holds. One minute was found to be sufficiently long for SpO 2 to return to its baseline level. Subjects were instructed to hold their breath for as long as possible. Throughout the study, subjects wore a nose clip and held the disposable mouthpiece (AFT36 bacteriological filter; Biopac System Inc., Santa Barbara, CA, USA) between their lips. After the data were collected, important oxygenation/deoxygenation events were manually labeled.
As depicted in Figure 1, we collected the following information: ECG (Biopac ECG100A; Biopac System Inc., Santa Barbara, CA, USA), right index finger SpO 2  In the breath-hold study, subjects were first asked to shave their chest hair to reduce interference. Subsequently, each subject performed 10 end-expiratory breath-holds while sitting in an upright posture with a one-minute break between breath-holds. One minute was found to be sufficiently long for SpO2 to return to its baseline level. Subjects were instructed to hold their breath for as long as possible. Throughout the study, subjects wore a nose clip and held the disposable mouthpiece (AFT36 bacteriological filter; Biopac System Inc., Santa Barbara, CA, USA) between their lips. After the data were collected, important oxygenation/deoxygenation events were manually labeled.
As depicted in Figure 1, we collected the following information: ECG (Biopac ECG100A; Biopac System Inc., Santa Barbara, CA, USA), right index finger SpO2 (Biopac OXY100E, TSD124A Finger Clip Transducer; Biopac System Inc.), and respiratory flow (TSD117A Medium Flow Pneumotach Transducer; Biopac System Inc.) data, all sampled at 2000 Hz. The Biopac OXY100E module reports an accuracy ±2% for a SpO2 range of 70-100%. We used the 3M™ Red Dot™ ECG electrodes (model 2660; 3M, Saint Paul, MN, USA) throughout the study. Data outside of this SpO2 range were discarded since the accuracy is unknown. Figure 1. Illustration of a subject undergoing the breath-hold study. Placements of the wearable patch biosensor (left side) and the ground truth Biopac sensors (right side) are depicted. The relative size of the biosensor is shown with an off-the-shelf ECG electrode and a penny. Note that the photodiode (PD) has an area of 4.5 mm 2 . PD1 is on the left side and PD2 is on the right side of the subject.
In parallel, we also attached the wearable patch biosensor to the subject's mid-sternum and collected single-lead ECG, two sets of multiwavelength PPGs (red, infrared [IR], and green), and triaxial seismocardiogram (SCG, not used in this study), sampled at 500, 67, and 1000 Hz, respectively. The hardware used in the biosensor is almost identical to Figure 1. Illustration of a subject undergoing the breath-hold study. Placements of the wearable patch biosensor (left side) and the ground truth Biopac sensors (right side) are depicted. The relative size of the biosensor is shown with an off-the-shelf ECG electrode and a penny. Note that the photodiode (PD) has an area of 4.5 mm 2 . PD1 is on the left side and PD2 is on the right side of the subject.
In parallel, we also attached the wearable patch biosensor to the subject's mid-sternum and collected single-lead ECG, two sets of multiwavelength PPGs (red, infrared [IR], and green), and triaxial seismocardiogram (SCG, not used in this study), sampled at 500, 67, and 1000 Hz, respectively. The hardware used in the biosensor is almost identical to that reported in our previous work [8,10,23] except for the addition of the PPG modules and the change in form factor. The ECG analog front-end (AFE) and the accelerometer AFE (for SCG) remain the same. Specifically, the PPG AFE used to drive the LEDs and obtain data from the PDs is the Maxim 86170 (Maxim Integrated, San Jose, CA, USA). The multi-chip LEDs, which has red (660 nm), and IR (950 nm), and green (526 nm) wavelengths, are the SFH 7016 (OSRAM, Munich, Germany), and the PDs are the VEMD 8080 (Vishay Semiconductors, Heilbronn, Baden-Württemberg, Germany). Serial Peripheral Interface was used as the communication protocol between the microcontroller and peripheral sensors. This device is also equipped with wireless capabilities (i.e., Bluetooth and Wi-Fi) for transmitting data. However, in this study, data were stored in the Secure Digital card and later retrieved by a custom-built software application as in previous work [8,10,23]. The battery life of the device at the full sample rates of all sensors is up to 60 h. The front and lateral views of the device are shown in Figure 2.
that reported in our previous work [8,10,23] except for the addition of the PPG modules and the change in form factor. The ECG analog front-end (AFE) and the accelerometer AFE (for SCG) remain the same. Specifically, the PPG AFE used to drive the LEDs and obtain data from the PDs is the Maxim 86170 (Maxim Integrated, San Jose, CA, USA). The multi-chip LEDs, which has red (660 nm), and IR (950 nm), and green (526 nm) wavelengths, are the SFH 7016 (OSRAM, Munich, Germany), and the PDs are the VEMD 8080 (Vishay Semiconductors, Heilbronn, Baden-Württemberg, Germany). Serial Peripheral Interface was used as the communication protocol between the microcontroller and peripheral sensors. This device is also equipped with wireless capabilities (i.e., Bluetooth and Wi-Fi) for transmitting data. However, in this study, data were stored in the Secure Digital card and later retrieved by a custom-built software application as in previous work [8,10,23]. The battery life of the device at the full sample rates of all sensors is up to 60 h. The front and lateral views of the device are shown in Figure 2. From the frontal view, the wearable patch biosensor is attached to the sternum of the subject using ECG electrodes. The superior end of the device starts approximately two fingers down from the suprasternal notch. From the lateral view, the protruded part of the device that houses the LEDs and PDs is visible. Note that light is hardly visible from the sides of the device.

Manual Labeling
In Figure 3, filtered, high-quality physiological signals acquired during breath-hold and breathing are presented. For each subject, we selected the photodiode (PD) with higher quality as determined by visual inspection. The discrepancy can be attributed to the differences in LED/PD separation distance as LED/PD separation distance can affect the quality of PPG [17] (p. 88). Further assessment may be needed since optimizing the LED/PD separation distance is a critical factor for obtaining a good quality signal. Manual labeling was performed using the respiratory flow and the SpO2 data. Alignment was necessary since it has been observed that deoxygenation events do not occur simultaneously for different body sites, and SpO2 measured at the finger is usually delayed from SpO2 measured at central sites [4,5,24]. This delay can be partially attributed to the oxygen-conserving effect induced by breath-hold. Similar to the diving response [6], breathhold also leads to bradycardia and peripheral vasoconstriction to reduce oxygen consumption in peripheries and redistribute blood flow to vital organs such as the brain and the heart [6]. The combined effect leads to a delayed deoxygenation measurement by a finger-pulse oximeter when compared to a pulse oximeter placed closer to the heart or the brain. Davies et al. reported a mean delay of 16.75 ± 5.88 s across subjects for their in-ear reflectance pulse oximetry [4].
From the respiratory flow (top signal in Figure 3), breathing (the "oscillating" part, pink) and breath-hold (the "silent" part, blue) segments can be easily distinguished. From the ground truth SpO2 (the second signal on the left in Figure 3), three distinct timestamps were recorded of each deoxygenation event: start, nadir, and end. The start of deoxygenation is defined as the point where SpO2 begins to drop drastically (rate of SpO2 decline > 0.5%/cardiac cycle for 3 consecutive cardiac cycle). The nadir of deoxygenation is defined Figure 2. From the frontal view, the wearable patch biosensor is attached to the sternum of the subject using ECG electrodes. The superior end of the device starts approximately two fingers down from the suprasternal notch. From the lateral view, the protruded part of the device that houses the LEDs and PDs is visible. Note that light is hardly visible from the sides of the device.

Manual Labeling
In Figure 3, filtered, high-quality physiological signals acquired during breath-hold and breathing are presented. For each subject, we selected the photodiode (PD) with higher quality as determined by visual inspection. The discrepancy can be attributed to the differences in LED/PD separation distance as LED/PD separation distance can affect the quality of PPG [17] (p. 88). Further assessment may be needed since optimizing the LED/PD separation distance is a critical factor for obtaining a good quality signal. Manual labeling was performed using the respiratory flow and the SpO 2 data. Alignment was necessary since it has been observed that deoxygenation events do not occur simultaneously for different body sites, and SpO 2 measured at the finger is usually delayed from SpO 2 measured at central sites [4,5,24]. This delay can be partially attributed to the oxygenconserving effect induced by breath-hold. Similar to the diving response [6], breath-hold also leads to bradycardia and peripheral vasoconstriction to reduce oxygen consumption in peripheries and redistribute blood flow to vital organs such as the brain and the heart [6]. The combined effect leads to a delayed deoxygenation measurement by a finger-pulse oximeter when compared to a pulse oximeter placed closer to the heart or the brain. Davies et al. reported a mean delay of 16.75 ± 5.88 s across subjects for their in-ear reflectance pulse oximetry [4].
From the respiratory flow (top signal in Figure 3), breathing (the "oscillating" part, pink) and breath-hold (the "silent" part, blue) segments can be easily distinguished. From the ground truth SpO 2 (the second signal on the left in Figure 3), three distinct timestamps were recorded of each deoxygenation event: start, nadir, and end. The start of deoxygenation is defined as the point where SpO 2 begins to drop drastically (rate of SpO 2 decline > 0.5%/cardiac cycle for 3 consecutive cardiac cycle). The nadir of deoxygenation is defined as the lowest SpO 2 within the deoxygenation event. The end of deoxygenation is defined as the point where SpO 2 returns to the baseline level. Usually, the nadir and the end of deoxygenation can be easily identified. To account for the delay of deoxygenation between finger and chest deoxygenation, the nadir deoxygenation of the finger is aligned to the end of the breath-hold, based on the assumption that chest arteries received welloxygenated blood immediately following the end of the breath-hold. Though our results suggest this alignment procedure is somewhat accurate, we found that a more precise alignment algorithm was required to achieve adequate accuracy; the updated alignment algorithm is applied and provided in Appendix A. deoxygenation can be easily identified. To account for the delay of deoxygenation between finger and chest deoxygenation, the nadir deoxygenation of the finger is aligned to the end of the breath-hold, based on the assumption that chest arteries received well-oxygenated blood immediately following the end of the breath-hold. Though our results suggest this alignment procedure is somewhat accurate, we found that a more precise alignment algorithm was required to achieve adequate accuracy; the updated alignment algorithm is applied and provided in Appendix A.

Principle of Pulse Oximetry
To relate the aligned signals to ground truth SpO2, relevant features in the biosensor signals need to be extracted. We followed the standard approach described in [17] (p. 131) and tailored the algorithm to our pulse oximetry. The key feature, R, defined as the ratio of the normalized AC component (also a ratio) of two optical wavelengths, can be extracted from the PPG signals: where ACred is the AC component of the red PPG, ACIR is the AC component of the IR PPG, DCred is the DC component of the red PPG, and DCIR is the AC component of the IR PPG. Normalization is performed by dividing the AC component of a wavelength by its DC component. R, along with the absorption coefficients of oxyhemoglobin (HbO2) and deoxyhemoglobin (Hb) for different wavelengths, can be used together to derive SaO2  To relate the aligned signals to ground truth SpO 2 , relevant features in the biosensor signals need to be extracted. We followed the standard approach described in [17] (p. 131) and tailored the algorithm to our pulse oximetry. The key feature, R, defined as the ratio of the normalized AC component (also a ratio) of two optical wavelengths, can be extracted from the PPG signals: where AC red is the AC component of the red PPG, AC IR is the AC component of the IR PPG, DC red is the DC component of the red PPG, and DC IR is the AC component of the IR PPG. Normalization is performed by dividing the AC component of a wavelength by its DC component. R, along with the absorption coefficients of oxyhemoglobin (HbO 2 ) and deoxyhemoglobin (Hb) for different wavelengths, can be used together to derive SaO 2 directly. According to [17] (p. 50), the theoretical relationship between SaO 2 and R is defined as: where ε Hb is the absorption coefficients of Hb, ε HbO2 is the absorption coefficients of HbO 2 , λ red is the wavelength of the red PPG, and λ IR is the wavelength of the IR PPG. If further approximated using a Taylor series expansion, directly. According to [17] (p. 50), the theoretical relationship between SaO2 and R is defined as: where εHb is the absorption coefficients of Hb, εHbO2 is the absorption coefficients of HbO2, λred is the wavelength of the red PPG, and λIR is the wavelength of the IR PPG. If further approximated using a Taylor series expansion, emerges as an empirical model that governs the relationship between SpO2, the surrogate of SaO2, and R. In Equation (3), A, B, C, and D replace the absorbance coefficients of the Hb and HbO2 at the two wavelengths [17] (p. 54), m is the slope, and b is the intercept. Figure 4a illustrates our signal-preprocessing pipeline, which we used to extract R

Preprocessing Overview
emerges as an empirical model that governs the relationship between SpO 2 , the surrogate of SaO 2 , and R. In Equation (3), A, B, C, and D replace the absorbance coefficients of the Hb and HbO 2 at the two wavelengths [17] (p. 54), m is the slope, and b is the intercept.

Preprocessing Overview
Figure 4a illustrates our signal-preprocessing pipeline, which we used to extract R from the wearable patch biosensor signals. Although there are other signals in the dataset, we only found six that are relevant to this work, namely, finger SpO 2 , red PPG, IR PPG, green PPG, ECG, and respiratory flow. The Biopac and biosensor signals were first resampled to 500 Hz and synchronized by maximizing the cross-correlation of their ECG signals. The Biopac SpO 2 was further aligned to the biosensor signals on a per breath-hold basis, using the manual label described in Section 2.3. Due to both respiratory artifacts when emerging from breath-hold and the lack of range in measured SpO 2 values, we only target SpO 2 estimation during the breath-hold segments of the signals. Each extracted R from the breath-hold segments was paired with the manually aligned SpO 2 , and both a scatter plot-demonstrating the correlation between R and SpO 2 -and distributions plots to show the skewness of R and SpO 2 -are depicted in Figure 4a. The skewness of R and SpO 2 can be partially attributed to our ability to maintain oxygenation homeostasis, enabled by the continuous supply of oxygen by the oxygen stores upon breath-hold [6]. Before uncovering the relationship between R and SpO 2 , we will first demonstrate robust feature extraction and the outlier rejection algorithms necessary to extract R for the chest-based pulse oximetry.

Robust Feature Extraction via Linear Transformation
To compute R, it is necessary to compute AC features and DC features of each PPG beat. Note that a feature represents a scalar value to represent the characteristic of a PPG beat in the context of this work. In Figure 4b, the block diagram for beat segmentation and feature extraction has been shown. To isolate the AC component, an empirically validated bandpass filter, with a passband of 0.35 to 4 Hz was first applied. The low cutoff was chosen to remove the baseline wander, due to involuntary respiratory movement. The high cutoff was chosen empirically so as to reduce the dicrotic notch and preserve only the frequency components with less variation across wavelengths. The AC component was segmented into PPG beats using ECG R-peaks detected by a Pan-Tomkins algorithm [25], modified for R-peak correction and further smoothed using 4-beat ensemble averaging. Conventionally, computing AC features for red and IR PPG relies on robust peak and valley extraction. Although we were able to minimize respiratory artifacts through the breath-hold protocol, involuntary respiratory movements were still present and observable in some subjects. Evidently, extracting R robustly from respiration-corrupted PPG can be challenging [12,26]. Conventionally, the peak and valley of PPG in each cardiac cycle are extracted to compute the AC features [17] (pp. 129-130). In a preliminary analysis, we found that this method is not reliable as the signal can be easily distorted by the subtle-yet still significant-involuntary respiratory movements at this low perfusion site. To address this, we introduced a novel algorithm that does not require peak and valley extraction. Specifically, by rearranging the terms in Equation (1)   Extracted pairs of ratio of ratios (R) and SpO 2 were used for training and calibration of the model. T i,j denotes the delay found between R and SpO 2 for the jth breath-hold of subject i. The aligned data were displayed in the upper right scatter plot. The distribution of R and SpO 2 are also shown above and on the right of the scatter plot, respectively. IBI mean denotes the mean of the interbeat intervals (IBI) of a subject. (b) The block diagram for PPG beats segmentation, AC red/IR extraction, DC feature extraction, and PPG green -based outlier rejection.
By computing the AC red/IR , the ratio of AC red to AC IR , we can avoid the difficulty of extracting peaks and valleys in distorted PPG signals. To do so, we leveraged the fact that the IR PPG beat, denoted as PPG IR , appears to have a similar morphology to the red PPG beat, denoted as PPG red , after being bandpass filtered. Therefore, we can model the relationship of the two PPG beats using a linear transformation method: where PPG red , PPG IR ∈ R N , N is the number of samples in the PPG beat, and α 1 , α 2 denote the scale and the bias that will minimize the 2 -norm of their differences. With the assumption that the differences, once optimized, should be closely distributed, we rejected beats with differences of more than five median absolute deviations from the median, which is a more robust rejection criterion compared to the "standard deviations around the mean" method [27]. Note we rejected only 1.79% using this method. The optimal scale, α 1 , represents the ratio of the AC component of the two wavelengths: In parallel, the DC component was isolated using a low-pass filter with a high cutoff frequency at 0.1 Hz. This cutoff was based on a heuristic assumption that physiological dynamics of faster than 0.1 Hz (e.g., involuntary respiratory movement) do not directly relate to the deoxygenation induced by breath-hold based on data shown in [28][29][30]. The DC component was similarly segmented and smoothed to ensure consistency with the processing steps for AC extraction. Finally, DC features were computed as the mean of the segmented DC beats.

PPG green -Based Outlier Rejection
Although we carefully selected the parameters of the preprocessing and feature extraction pipeline, some PPG beats may still be distorted due to motion artifacts and involuntary respiratory movements and therefore can hinder accurate SpO 2 estimation. Hence, we designed a novel outlier rejection algorithm using the green PPG beats as a signal quality template for its robustness against noise [31], so as to reduce the contamination of abnormal features extracted. Our signal quality assessment relied on two assumptions: (1) reliable red or IR PPG beats in the bandwidth filtered constitute a morphology similar to that of green PPG beats; and (2) outliers in AC ratios are defined as datapoints that deviate by more than five median absolute deviations from the median (similar to the AC red/IR rejection method). To determine the similarity, we consider a methodology described in [32]. First, the normalized cross-correlation (NCC) between a PPG beat with its corresponding template is computed: where PPG λ (n) denotes the n th sample in the PPG beat of wavelength λ, PPG λ denotes the average value of the samples in a PPG beat of wavelength λ, and NCC k, λ denotes the correlation coefficient between PPG λ and the k-lag PPG green . Next, the maximal NCC λ , NCC max,λ , defined as is selected as a measure of the SQI of the PPG λ and has a range of [0, 1]. Both assumptions translate directly to the two upper right blocks in Figure 4b. Each signal quality index (SQI) method has an empirically determined threshold, 0.7, and a sample is excluded if either SQI method suggests so. The PPG green -based outlier rejection algorithm rejected nearly 8.07% of the beats.

Computation of R
The output matrix in Figure 5b has a dimension of N beats × 5, where the five columns represent the AC features (AC red/IR ), two DC features (DC red , DC IR ), and the two binary SQI decisions (SQI red , SQI IR ). Only features approved by the SQI algorithm were used to compute R. Note that we also experimented with the peak and valley method, but it would require a more aggressive outlier rejection threshold (~30% rejection ratio) to achieve comparable SpO 2   In this dataset, R is a unit-less measure and generally ranges from 0.4 to 1.6 for SpO 2 above 70%.

Linear Regression
The temporally aligned SpO 2 and the extracted R were subsequently used to train the parameters in Equation (3). The parameters m (slope) and b (intercept) were estimated by minimizing the 2 -norm of the difference between the ground truth SpO 2 and estimated SpO 2 : where x denotes pairs of SpO 2 and R, and f represents an arbitrary function for determining the optimal parameters of an objective function.

Training and Calibration Schemes
Since including a one-time, short calibration procedure is realistic for practical usage of the device, we also investigated the best training and subject-specific calibration proce-dure. Three training and calibration schemes were considered, including a (1) globalized scheme containing subject-independent training (see Figure 5a); (2) semi-globalized scheme featuring global training with subject-specific calibration (see Figure 5b); (3) subject-specific scheme (see Figure 5c). The globalized scheme is equivalent to the standard LOSO cross validation. The semi-globalized scheme described herein is analogous to the semi-globalized method discussed in [33], aside from the fact that we used duration rather than number of points to standardize the subject-specific calibration. Particularly, in the semi-globalized scheme the globally trained intercept b was replaced by a subject-specific calibrated b (using the data in the first of the 10 breath-holds). The subject-specific scheme involved training both parameters using only the first breath-hold data of the subject. Note that we also explored calibration using m, but the results were considerably worse and therefore not reported. In both globalized and semi-globalized schemes, LOSO cross validation was also used to assess generalizability of the models trained. To compare model performance fairly and to avoid data leakage, we excluded the first breath-hold of the test subjects for evaluation of the globalized schemes to ensure identical testing data.

Evaluating Model Performance
To assess the performance of these three schemes, we recorded the RMSE, the parameters of the linear model on a per subject basis, and the Pearson correlation coefficient (PCC) of estimated SpO 2 on all subjects jointly. The mean and the standard deviation of the subject-specific RMSEs were computed to summarize the performance of each scheme and subsequently used as the critical metric to assess the capability of the pulse oximetry. Note that the errors presented in this work are all absolute errors rather than relative/percentage errors. The unit of RMSE is denoted by %, which represent the oxygen saturation level.

Accuracy of SpO 2 Estimation
In Figure 6, regression plots and Bland-Altman plots are provided to demonstrate the estimation results. We also summarize the RMSEs across subjects, PCC, bias, and 95% limits of agreement (LOR) in Table 2. The globalized scheme achieves lowest accuracy (see Figure 6a,b). The semi-globalized scheme shows better accuracy (see Figure 6c,d). The subject-specific scheme achieves the best accuracy (see Figure 6e,f). Using the semiglobalized model, we were able to lower the mean RMSE by 0.36% and increase PCC by 0.02 when compared to the globalized model. The semi-globalized scheme and the subject-specific scheme have similar performance levels, both of which are superior to the globalized scheme. From the Bland-Altman plots, both models show minimal bias.

Semi-Globalized Scheme vs. Subject-Specific Scheme
Since it has not been previously examined in the literature, we also studied which parameters benefit the most from subject-specific calibration. This is accomplished by comparing the semi-globalized scheme (i.e., calibrating b) to the subject-specific scheme (i.e., calibrating both m and b) while varying the calibration duration constraints. The duration constraint was imposed by considering data only within the said duration. Surprisingly, the semi-globalized model works more efficiently at reducing RMSE, as shown in Figure 7. Note 2 outlier subjects were excluded for better visualization. In all three duration constraints (10 s, 20 s, and 30 s), the semi-globalized schemes achieved a lower RMSE.
When comparing the RMSE of calibrating b, constrained by a calibration duration of 10 s, to calibrating both parameters by a calibration duration of 30 s, we found no statistical significance (p > 0.05) as determined by a paired sample t-test. Therefore, we determined that the semi-globalized scheme is the best calibration strategy for this dataset as it would require a shorter duration to achieve similar performance to the subject-specific scheme.

Semi-Globalized Scheme vs. Subject-Specific Scheme
Since it has not been previously examined in the literature, we also studied which parameters benefit the most from subject-specific calibration. This is accomplished by comparing the semi-globalized scheme (i.e., calibrating b) to the subject-specific scheme (i.e., calibrating both m and b) while varying the calibration duration constraints. The duration constraint was imposed by considering data only within the said duration. Surprisingly, the semi-globalized model works more efficiently at reducing RMSE, as shown in Figure 7. Note 2 outlier subjects were excluded for better visualization. In all three duration constraints (10 s, 20 s, and 30 s), the semi-globalized schemes achieved a lower RMSE. When comparing the RMSE of calibrating b, constrained by a calibration duration of 10 s, to calibrating both parameters by a calibration duration of 30 s, we found no statistical significance (p > 0.05) as determined by a paired sample t-test. Therefore, we determined that the semi-globalized scheme is the best calibration strategy for this dataset as it would require a shorter duration to achieve similar performance to the subject-specific scheme.

Standardizing Subject-Specific Calibration: Duration vs. SpO2 Range.
To study the most efficient way to collect data for calibration, we also examined the changes in RMSE by imposing different constraints on the calibration data, including a duration constraint and SpO2 range constraint. Similar to the duration constraint, the SpO2 range constraint considers data only within the said SpO2 range. According to the results shown in Figure 8a, we found that increasing calibration duration from 1 s to 20 s while fixing SpO2 range to 30% leads to significant (p < 0.05) reduction in mean RMSE across the

Standardizing Subject-Specific Calibration: Duration vs. SpO 2 Range
To study the most efficient way to collect data for calibration, we also examined the changes in RMSE by imposing different constraints on the calibration data, including a duration constraint and SpO 2 range constraint. Similar to the duration constraint, the SpO 2 range constraint considers data only within the said SpO 2 range. According to the results shown in Figure 8a, we found that increasing calibration duration from 1 s to 20 s while fixing SpO 2 range to 30% leads to significant (p < 0.05) reduction in mean RMSE across the subjects. On the other hand, the results shown in Figure 8b suggest that increasing the calibration SpO 2 range from 1% to 20%, while fixing the duration to 30 s, did not lead to a significant (p > 0.05) difference in mean RMSE. Paired sample t-tests were used for the statistical analysis. Standardizing calibration duration appears to be the best calibration strategy here.
OR PEER REVIEW 14 of 19 subjects. On the other hand, the results shown in Figure 8b suggest that increasing the calibration SpO2 range from 1% to 20%, while fixing the duration to 30 s, did not lead to a significant (p > 0.05) difference in mean RMSE. Paired sample t-tests were used for the statistical analysis. Standardizing calibration duration appears to be the best calibration strategy here.

Effect of Varying Melanin Content
Since none of the subjects had nail polish on their right index finger or tattoos on their sternum, we only considered the confounding effect of the difference in melanin content. In this analysis, we assessed melanin levels using self-reported Fitzpatrick skin types [34] and studied the way melanin content affects the bias between finger and sternum SpO2. In Figure 9, the errors between different Fitzpatrick skin types are shown to be statistically insignificant (p > 0.05), using a one-way analysis of variance (ANOVA). This implies that our device does not introduce different bias for subjects with varying skin melanin content when compared to the finger pulse oximeter, understandably, as both operate on the same principles. However, it is worth noting that there are five subjects for skin type I, nine subjects for skin type II, three subjects for skin type III, three subjects for skin type IV, and zero subject for skin type V and VI. Due to the limited sample size and lack of data for the darkest Fitzpatrick skin types, the results attained here may not provide meaningful insight with a true accuracy of the proposed chest-based pulse oximetry on persons of all melanin levels. Note that in [35,36], it was reported that melanin content leads to SpO2 overestimation at low SaO2. Further investigation is required to study the way melanin content affects the accuracy of the chest-based pulse oximetry at various SaO2 levels.

Effect of Varying Melanin Content
Since none of the subjects had nail polish on their right index finger or tattoos on their sternum, we only considered the confounding effect of the difference in melanin content. In this analysis, we assessed melanin levels using self-reported Fitzpatrick skin types [34] and studied the way melanin content affects the bias between finger and sternum SpO 2 . In Figure 9, the errors between different Fitzpatrick skin types are shown to be statistically insignificant (p > 0.05), using a one-way analysis of variance (ANOVA). This implies that our device does not introduce different bias for subjects with varying skin melanin content when compared to the finger pulse oximeter, understandably, as both operate on the same principles. However, it is worth noting that there are five subjects for skin type I, nine subjects for skin type II, three subjects for skin type III, three subjects for skin type IV, and zero subject for skin type V and VI. Due to the limited sample size and lack of data for the darkest Fitzpatrick skin types, the results attained here may not provide meaningful insight with a true accuracy of the proposed chest-based pulse oximetry on persons of all melanin levels. Note that in [35,36], it was reported that melanin content leads to SpO 2 overestimation at low SaO 2 . Further investigation is required to study the way melanin content affects the accuracy of the chest-based pulse oximetry at various SaO 2 levels.
Biosensors 2021, 11, x FOR PEER REVIEW Figure 9. Mean bias across varying Fitzpatrick skin types. The differences between Fitzpa types are not statistically significant (p > 0.05). One-way analysis of variance (ANOVA) w for statistical analysis.

Discussion
We unified previous evaluations of central-pulse oximetry and addressed concerns while showing an accuracy that is comparable to the state-of-the-art [13 best of our knowledge, this is the first thorough evaluation of chest-based pulse o that jointly features a sufficient sample size, a wide dynamic range of SpO2, mini piratory artifacts, and rigorous cross validation to avoid data leakage. Furtherm study protocol, our alignment method, and key algorithmic components were d in full detail to allow for replication. This work paves the way for realizing the sim ous monitoring of, in addition to SpO2, the cardiac, pulmonary, and cardiopul functions using a small, standalone wearable patch device continuously and re unlocking opportunities in personalized health intervention outside of a clinical s

Accurate SpO2 Estimation
We achieved low mean RMSEs for all training and calibration schemes, whi well within the criteria (RMSE ≤ 3.5%) for reflective pulse oximetry outlined by t standard [16]. Our work addresses the challenges of the aforementioned approac estimated SpO2 accurately using a novel algorithm that proves to be robust, for PP ured at this poorly perfused site. The breath-hold protocol successfully induced emia and reduced respiratory artifacts. Furthermore, the novel algorithms desc Section 2.4.2. to derive R leverage the morphological similarity between PPGred an Our method avoids peak and valley extraction for distorted PPG beat, and prov less susceptible to artifacts. The PPGgreen-based outlier rejection algorithm was ins the robustness of PPGgreen against motion artifacts [31]. Together, they alleviated d in feature extraction for most PPG beats and excluded undesired PPG beats robu

Standardization of Subject-Specific Calibration
Besides accurate SpO2 measurements, we also designed experiments to iden best training and calibration for this dataset for improving RMSE. More data poi to better calibrate the model to the test subject, and they do so by reducing the noi R extracted rather than capturing a wider SpO2 dynamic range, as evident from th

Discussion
We unified previous evaluations of central-pulse oximetry and addressed relevant concerns while showing an accuracy that is comparable to the state-of-the-art [13]. To the best of our knowledge, this is the first thorough evaluation of chest-based pulse oximetry that jointly features a sufficient sample size, a wide dynamic range of SpO 2 , minimal respiratory artifacts, and rigorous cross validation to avoid data leakage. Furthermore, the study protocol, our alignment method, and key algorithmic components were described in full detail to allow for replication. This work paves the way for realizing the simultaneous monitoring of, in addition to SpO 2 , the cardiac, pulmonary, and cardiopulmonary functions using a small, standalone wearable patch device continuously and remotely, unlocking opportunities in personalized health intervention outside of a clinical setting.

Accurate SpO 2 Estimation
We achieved low mean RMSEs for all training and calibration schemes, which were well within the criteria (RMSE ≤ 3.5%) for reflective pulse oximetry outlined by the FDA standard [16]. Our work addresses the challenges of the aforementioned approaches and estimated SpO 2 accurately using a novel algorithm that proves to be robust, for PPG measured at this poorly perfused site. The breath-hold protocol successfully induced hypoxemia and reduced respiratory artifacts. Furthermore, the novel algorithms described in Section 2.4.2 to derive R leverage the morphological similarity between PPG red and PPG IR . Our method avoids peak and valley extraction for distorted PPG beat, and proves to be less susceptible to artifacts. The PPG green -based outlier rejection algorithm was inspired by the robustness of PPG green against motion artifacts [31]. Together, they alleviated difficulty in feature extraction for most PPG beats and excluded undesired PPG beats robustly.

Standardization of Subject-Specific Calibration
Besides accurate SpO 2 measurements, we also designed experiments to identify the best training and calibration for this dataset for improving RMSE. More data points help to better calibrate the model to the test subject, and they do so by reducing the noise in the R extracted rather than capturing a wider SpO 2 dynamic range, as evident from the results in Figure 6. Subject-specific calibration of b helps to reduce the randomness in the data. In contrast, if we were to calibrate m alone, we expect SpO 2 range to have a more important role. Ultimately, we can benefit from both longer calibration duration and wider SpO 2 range, but in a situation where hypoxemia is not preferred by the intended users, we have shown that limited SpO 2 dynamic range with a sufficiently long calibration duration can still efficiently improve accuracy. This finding is beneficial for the usage of the device since it alleviates the need to induce changes in SpO 2 and consequently makes the subject-specific calibrating procedure more practical and safer. Consequently, if necessary, breath-hold duration should be standardized instead of the SpO 2 range.
When observing the data used to calibrate both m and b for test subjects, we noticed that the first breath-hold is consistently shorter across subjects. As a direct consequence, data of the first breath-hold generally does not have a wide SpO 2 dynamic range. Furthermore, subjects seemed to be able to hold their breath longer due their adaptive tolerance to withstand the vaguely defined "discomfort" [37]. Since estimating the slope m requires a sufficient dynamic range of SpO 2 , calibrating m using just the first breath-hold is usually not enough. Using all 10 breath-holds and adequate SpO 2 dynamic range across all training subjects offers a clear advantage to the globally trained m over the calibrated m. Hence, when the calibration data were limited (within the duration of one breath-hold), training m globally and calibrating b using the test subject's data can achieve better accuracy.

Practical Use Case
The current manufacturing cost of the device is on the order of $200. However, producing this device on a large scale can reduce the cost substantially as components would be ordered in volume and manufacturing processes can be refined to improve manufacturability. We do not foresee any challenges with scalability as the devices manufactured so far show robust functionality. The practical subject-specific calibration procedure can be designed by aggregating the conclusions made thus far. We suggest a 15 s breath-hold during which the ground truth SpO 2 and biosensor data are collected from a target subject. This breath-hold duration is selected because all breath-hold durations across the subjects in this study exceed 15 s. Using the data from the subjects analyzed in this study, we found the globally trained slope m global to be −21.54 and the intercept b global to be 106.69. Following the semi-globalized scheme, b global can be replaced by b subject-specific , which was calibrated using data from the 15 s breath-hold of the target subject. The resulting subject-specific linear model takes the form of SpO 2 = m global × R + b subject-specific . One potential reason for calibration failure could be the adoption of smoking behavior. According to [38], smokers have elevated levels of carboxyhemoglobin (COHb). As a result, assumptions (i.e., the only hemoglobin species in the arteries are Hb and HbO 2 ) made in Equation (2) are violated, which can subsequently lead to the overestimation of SpO 2 .

Limitations
One key limitation is that we only validated the accuracy of our pulse oximetry during segments with minimal respiratory artifacts. Kramer et al. [13] previously reported high accuracy from subjects undergoing spontaneous breathing despite a lack of details of their algorithm. Future work should investigate whether our proposed ACr ed/IR extraction and PPG green outlier rejection algorithm can withstand respiratory artifacts more severe than the involuntary respiratory movements during breath-hold and attain similar accuracy. In addition, we noticed that the deoxygenation dynamics of R still may not be perfectly aligned to those in SpO 2 , even after precise alignment. Our alignment method assumed that the delay between the start of breath-hold and chest SaO 2 deoxygenation is distributed across the integer values in the interval [−10, 10], as described in Appendix A. However, the delay may exceed beyond this, and therefore it is likely that we still captured an undesired delay for some subjects. Specifically, inter-subject and intra-subject variability in this delay may directly translate to variability in the estimated b, which explains the importance of calibrating b. For example, consider the case where the error in alignment is 1 s in a breath-hold of a subject. The desaturation rate, of around 0.26% per second, can be roughly estimated from Figure 3. The 1 s error in alignment can lead to a difference of 0.26% for all datapoints of that breath-hold, and therefore, systematically introduces a bias. Besides this bias, the breath-hold study method may also lead to another shortcoming. The SpO 2 at different measurement sites may not map to one another completely, even if the delay has been accounted for [5]. This is understandable because the sternum and finger have different tissue and vascularization compositions and therefore deliver and consume oxygen at different rates as well. Finally, similar to most commercial finger pulse oximeters, the chest-based pulse oximetry may suffer from motion artifacts, the presence of Hb CO and Hb MET , venous pulsation [39], etc. However, we expect to considerably improve RMSE further if we can induce deoxygenation slowly and compare the estimation when using the validation protocol suggested by the FDA [16] and show improved accuracy of biosensor's measurements despite the above limitation.

Future Work
The results and techniques demonstrated in this work allow for the accurate measurements of SpO 2 , which can in turn be used to better inform underlying pulmonary dysfunctions unobtrusively, continuously, and remotely. Together, with its ability to measure cardiac function, we can next validate the wearable patch biosensor for its ability to quantitatively and objectively assess disease progression of cardiovascular and pulmonary diseases such as COVID-19, nocturnal hypoxia caused by sleep apnea, and high-altitude sickness. Ultimately, tracking these health parameters may provide a better understanding of the cardiopulmonary-related comorbidities and consequently facilitate the adoption of longitudinal wearable monitoring devices, for detecting underlying disease when symptoms are subtle and unnoticeable.

Conclusions
Here, we demonstrated that our custom, chest-worn wearable patch biosensor was capable of accurately estimating SpO 2 while subjects underwent a 10 breath-hold protocol. We presented that standardizing the calibration duration, rather than calibration range, was the most important factor for optimal calibration. Finally, we found that differences in Fitzpatrick skin types do not introduce disparities in bias. Future studies will focus on improving the study protocol to induce gradual changes in SaO 2 as per the FDA guidelines, while measuring gold standard SaO 2 simultaneously through a co-oximetry of arterial blood samples [16], designing algorithms that mitigate respiratory artifacts when present, and by recruiting a larger population that is demographically diverse, especially participants with higher Fitzpatrick skin types. Together with its holistic cardiac monitoring, this device can provide longitudinal and quantitative information of disease progression in both cardiovascular and pulmonary diseases.  Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. The data are not publicly available due to them containing information that may compromise the privacy of the research subjects.

Conflicts of Interest:
O.T.I. is a Scientific Advisor to Physiowave, Inc., and Co-founder and Chief Scientific Advisor for Cardiosense, Inc. M.E. is also the Co-founder and Chief Technical Advisor for Cardiosense, Inc. J.A.H. is the Co-founder and Hardware Lead for Cardiosense, Inc. The funders had no role in the design of the study; in the collection, analyses or interpretation of the data; in the writing of the manuscript; or in the decision to publish the results.