Hearables: In-Ear Multimodal Data Fusion for Robust Heart Rate Estimation

: Background: Ambulatory heart rate ( HR ) monitors that acquire electrocardiogram (ECG) or/and photoplethysmographm (PPG) signals from the torso, wrists, or ears are notably less accurate in tasks associated with high levels of movement compared to clinical measurements. However, a reliable estimation of HR can be obtained through data fusion from different sensors. These methods are especially suitable for multimodal hearable devices, where heart rate can be tracked from different modalities, including electrical ECG, optical PPG, and sounds (heart tones). Combined information from different modalities can compensate for single source limitations. Methods: In this paper, we evaluate the possible application of data fusion methods in hearables. We assess data fusion for heart rate estimation from simultaneous in-ear ECG and in-ear PPG, recorded on ten subjects while performing 5-minute sitting and walking tasks. Results: Our findings show that data fusion methods provide a similar level of mean absolute error as the best single-source heart rate estimation but with much lower intra-subject variability, especially during walking activities. Conclusion: We conclude that data fusion methods provide more robust HR estimation than a single cardiovascular signal. These methods can enhance the performance of wearable devices, especially multimodal hearables, in heart rate tracking during physical activity.


Introduction
Ambulatory wearable heart rate trackers provide physiological measurements during dynamic everyday real-world activities.However, they have been characterized as being less accurate in tasks associated with high levels of movement compared to data acquired in clinic.The accuracy of wearable devices is related to their placement; for example, a device on the wrist is more likely to catch movement artefacts than one on the chest [1].Nonetheless, even when placed on the chest, the relative error of a heart rate monitor increases with the level of exercise intensity [2].
In addition, the performance of automatic heartbeat detection algorithms depends on signal-to-noise ratio [3]; for example, for signal-to-noise ratios below 5 dB, the R peak detection in electrocardiogram (ECG) is considered unreliable [4].Generally, a detector sensitivity and positive predictive value decrease for ambulatory data compared to the standard system in a clinic [5].Even when poor-quality data or data corrupted by motion artefacts are excluded from analysis, the accuracy of detectors applied to ambulatory data is typically worse than that for data acquired in clinics [6].Georgiou et al. [7] pointed out that so far, wearable devices can only be used as a surrogate for heart rate variability at resting or mild exercise conditions, as their accuracy fades out with increasing exercise load.
Hearables are a very convenient wearables modality, owning the privileged position of the head and ear canal on the human body and the fixed distance to vital organs.However, the in-ear ECG [8] signal, measured between electrodes placed inside the ear canal, has a smaller amplitude that standard ECG acquired from torso and a lower signal-to-noise ratio, making it difficult to automatically detect R peaks with standard algorithms [9].On the other hand, earpieces provide a good fit and benefit from collocated position of multiple sensors (electrodes, accelerometer, microphone, and photoplethysmography (PPG) sensor) on an earplug [10].
The multimodalities of hearables has already been employed in mental stress detection [11], showing that classification performance can be improved by utilizing heart rate variability features extracted from ear-ECG, with breathing and oxygen saturation features extracted from ear-PPG signals.Multimodality also has a crucial role in artefacts removal [12], where signals from microphones and accelerometer were used to model artefacts and remove them from ear-EEG recordings.Regarding the estimation of heart rate, it can be monitored using various in-ear signal modalities [13]: • Electrical (ECG) [11]; • Optical (PPG) [14,15]; • Sounds (heart tones) [16].
The multimodality of hearables provides an opportunity for more robust HR estimation by combining heart rate data from various sources using data fusion methods.
Data fusion techniques have the potential to improve estimation accuracy by sensor redundancy (e.g., multiple PPG signals), or by estimating HR from different sensor modalities (PPG and ECG).For example, data fusion can be achieved with weighted average, where weights are automatically adjusted based on signal quality indexes (SQIs).This approach ensures that data from high SQI signals, i.e., likely to be more accurate, are used for HR estimation.
In this paper, we evaluate two methods of data fusion: the method described by Li et al. [17] and the method proposed by Rankawat and Dubey [18] for heart rate estimation from simultaneous in-ear ECG and in-ear PPG, recorded on eight subjects while performing 5-minute sitting and walking tasks.

Rankawat and Dubey's [18] Method
Rankawat and Dubey [18] proposed a data fusion method whereby HR is estimated from n different sources and fused as a weighted average The weight w i for the i-th source is estimated based on the signal quality index (SQI) and the source type.Rankawat and Dubey [18] identified two categories of signals: cardiovascular, directly related to heart activity, such as ECG, PPG, and arterial blood pressure (ABP); and non-cardiovascular signals where the heart related component is an artefact, such as electroencephalogram, electrooculogram and electromyogram.For the latter, the Teager-Kaiser energy operator [19] was used to perform beat detection.
The weights for signals were taken according to Table 1.Rankawat and Dubey [18] assessed SQI as a product of a "beat rhythm factor" and a "beat deviation factor", calculated using eight previous beats.The beat rhythm factor (C i ) was calculated as 1 minus the ratio of standard deviation (σ i ) and the median of RR intervals (µ i ) as The beat deviation factor (D i ) was assessed based on the deviation of the current RR interval (x i ) from a mean value (µ i ) as where σ indicates the weight associated with each signal, and is calculated as where HR k is heart rate, X − k is Kalman filter state prediction, and SQI k is the signal quality index for the current beat.
The goal of the Kalman filter is to produce evolving optimal estimates of a modelled process from noisy measurements of the process.The Kalman filter is a set of mathematical equations that provide a computationally efficient way (recursive) to estimate the state of a process, by minimizing the mean squared error of estimations [20].
In the context of HR tracking, the state process of the filter can be modelled as a random walk, where a single state value (HR) in each step changes to a random value.Then, the previous HR value can be used to estimate HR at the next step.Therefore, in the initial step of the recursive filtering algorithm, the next state value (X − k+1 ) is predicted as Then, the system state error covariance of the next step (P k+1 ) is calculated as where Q is process covariance, which is set during filter design.Li et al. [17] empirically found Q = 0.1 as the optimal value.For higher Q, the filter starts to follow the HR observation too closely, while for lower values, the filter firmly trusts its own estimation and does not adapt to observations.The initial P + 0 is defined during the design of the filter.Next, the Kalman gain (K k ) is calculated as where R k is the measurement covariance.Li et al. [17] adjusted R k based on the SQI as with R as the base covariance value set during filter design, and R k and Q indicate filter behaviour.As mentioned before, higher Q values result in the filter strongly following its own model prediction.Similarly, a higher R k causes the filter to follow measured values.An adjustable R k ensures that the filter will firmly follow its own prediction for low SQIs, while it will prefer HR estimation for higher SQIs.The last step of the Kalman filter algorithm is the correction of process state estimation and filter covariance: where Z k is the measured HR estimated from the signal.The corrected estimations X + k are the filter output used for data fusion.
As an SQIs, Li et al. [17] used values combined from four different SQI estimation: (1) based on the comparison of different QRS detectors on a single lead ECG; (2) based on the comparison of QRS detection using different leads; (3) based on signal kurtosis, similarity to Gaussian distribution; and (4) based on spectral distribution of ECG, the ratio of power spectral density of QRS complex (range from 5 to 14Hz) and power spectral density between 5 and 50 Hz.

Validation of Methods on In-Ear Measurements
We recorded 10 healthy subjects (4 females and 6 males, aged 24-34) during 5 min of resting sitting and 5 min of walking on a treadmill at a speed of 4 km/h.We acquired a one-channel cross-head ECG by positioning one viscoelastic foam sensor [10] in each ear, where the biopotential was measured between electrodes placed in both ears and one standard torso ECG (modified Lead I configuration).Simultaneously, we measured signals from two PPG sensors MAX30101 (Maxim Integrated) , placed in both ears' conchae mounted on a flexible shell developed in our lab.PPG signals for green, red, and infrared light were recorded.
The study was conducted under the approval of the Imperial College ethics committee (JRCO 20IC6414), and all subjects provided full informed consent.
To detect onsets in PPG signals, we first performed inversion of acquired signal polarization and lowpass filtration (cut-off frequency = 12 Hz).Next, to detect the onsets of the PPG signals, we used the qppg function from the PhysioNet Cardiovascular Signal Toolbox [21].Detections were performed for all six PPG signals (Red, Infrared, and Green wavelengths for each of the two sensors) independently.
To detect R-peaks in Ear-ECG, we used a deep matched filter detector introduced by Davies et al. [22].The detector consists of an encoder stage (trained as part of an encoderdecoder module to reproduce ground truth ECG), which operates as a Matched Filter.The encoder section searches for matches with an ECG template pattern in the input signal, prior to refining and filtering the matches with the subsequent convolutional layers and an R-peak classifier stage.This classifier consists of a single-layer 1D convolution, followed by a Sigmoid activation function, flattening, and a linear output layer.The proposed method has been shown to provide higher median R-peak recall and precision than standard matched filters [22].The detector was previously trained using a separate dataset; we did not modify model weights for this study.
For the estimation of SQI in heart signals, the calculate_ppgsqi function from PhysioNet Cardiovascular Signal Toolbox was used.This function employs multiple-template matching stages, and was described by Li et al. [23].Firstly, a signal beat dynamic template was built by averaging beats in a 30 s signal window.Then fiducial points were chosen as the detected onsets and the R peaks for the PPG and ECG signals, respectively.Each beat starts from the fiducial point and ends at the fiducial point of the next beat.The mean SQI was obtained from four SQIs estimation methods.Three SQIs were based on the correlation of the template with: a beat, a linear interpolation of the beat, and the beat after dynamic time warping to match the template.The fourth SQI was the percentage of samples that were saturated (to the maximum or minimum values).We applied this method to all acquired signals: in-ear PPGs and in-ear ECG.
We averaged the estimated HR and SQI values from each of the 7 sources and data fusion methods over 10 s windows without overlapping.We calculate the mean absolute errors (MAEs) for each HR estimation method with respect to the HR values obtained from the torso ECG, used as a ground truth.

Results
Figure 1 shows scatter plots illustrating the relationship between the heart rate (HR) estimates obtained from in-ear signals and the HR values derived from data fusion methods, in comparison to the ground truth HR estimation from torso ECG.The highest correlation to the reference for a single source was R = 0.38 for PPG1 IR signal, while the highest correlation R = 0.60 was obtained when using Rankawat and Dubey's method [18].Notably, the HR estimates derived from PPG signals tend to be overestimated (frequently above the perfect correlation y = x orange line).On the other hand, HR estimates obtained from in-ear signals using the DeepMF method are more likely to underestimate the true HR values.
It is important to note that the Rankawat and Dubey's method has the capability to reject outliers and may not provide results in cases where the signal does not have an adequate SQI value.For example, in subject 7 the method gave results only in two segments from thirty.
Table 2 summarizes MAEs obtained during sitting from in-ear source signals and data fusion methods.The corresponding MAEs values during the walking activity are summarized in Table 3.
Rankawat and Dubey's method consistently demonstrated the lowest mean MAEs across subjects for both activities, with values of 8.0 bpm during sitting and 15 bpm during walking.Notably, Li et al.'s method outperformed the best single-source HR estimation method, specifically during sitting based on the PPG2 IR signal (17 bpm compared to 23 bpm) and during walking (18 bpm compared to 23 bpm for Green PPG2).Rankawat and Dubey's method had high MAE values in subjects 2, 5, and 10 (Table 2).However in subjects 2 and 10 the method had the lowest MAE value among all.In subject 5, the method followed too closely the results obtained from the DeepMF, while better outcomes were observed when using the Red and IR PPG signals.(Figure 1).
During walking, Rankawat and Dubey's method maintained an acceptable MAE (below 5 bpm) in three subjects (1, 4 and 8).Otherwise, low acceptable MAE values were only obtained in subject 1 when using PPG2 IR and Green signals.
Figure 2 shows the relationship between the MAEs of HR estimation and mean SQI values, with each data point representing values for different subjects.For PPG signals, when the SQI values were high, the MAE was consistently below 20 bpm.With SQI values below 50, the MAE tended to rise.In contrast, for in-ear ECG signals, the relationship between MAE and SQI was not as clear.The MAE was low for subjects 1 and 4, even though their SQI was below 50, and those with a higher SQI like subjects 3 and 5 had larger MAEs.

Discussion
We have shown that data fusion methods have lower MAEs than single-source HR estimations (Table 2).Furthermore, data fusion methods reduced the variation of MAEs and provided more robust HR estimation, especially during the walking activity (Table 3) when signals are affected by motion artefacts.
The major concept of data fusion methods is to select the best available sources for HR estimation.The main advantage of Rankawat and Dubey's method is its ability to reject measurements when there is no valid source (every signal has an SQI lower than 0.7).In this case, the method does not provide HR estimation.On the other hand, the Li et al. [17] weighting algorithm uses information from all sources, even from low quality ones.When all of them are poor and have a very low SQIs, the resulting HR estimation will include information from all of them and provide an unreasonable estimation.
In this study, for in-ear recordings during sitting, the HR estimated from five subjects (subjects: 2, 6, 7, 8 and 10), based on individual signals, had MAEs greater than 5 bpm.In these situations, Rankawat and Dubey's method [25] correctly rejected invalid measurements and kept MAE values at a reasonable level, lower than single-source estimation.On the other hand, Li et al.'s [17] algorithm resulted in an MAE slightly higher than the best single-source method.
For correct data fusion, it is critical to correctly estimate SQIs and to prevent the usage of invalid data when estimating HR.Rahman et al. [25] evaluated the performance of different SQIs on synthetic data (ECG recordings with artificially added noises).They found that the performance of SQI considerably fluctuated against varying datasets and concluded that fixed threshold-based SQIs cannot be used as a robust noise detection system.They suggested using adaptive thresholds and machine learning mechanisms to improve signal quality assessment.
In our study, quality SQI estimation was especially challenging in subject 5 (Figure 1), where SQIs for in-ear ECG were overestimated and led to incorrect estimations of HR provided by Rankawat and Dubey's method.This was also observed for SQI estimation in an in-ear ECG recording (Figure 2).SQIs did not seem to be related to MAE, while in a standard scenario, a higher SQI should lead to a lower MAE, as in the case of PPG signals.The method used in this study for estimating SQI is based on the correlation of an individual beat with a template built on the average of 30 previous beats.This method seems reasonable for PPG signals where repeatable pulsation has a much larger amplitude than noise.However, it does not seem to be working properly in the case of in-ear ECG signals where the signal-to-noise ratio is likely to be lower (noise level is similar to ECG amplitude).
The in-ear ECG signal requires a more dedicated method for SQI estimation.Improvements in signal quality assessment, for example with deep neural networks [26] or cascade of classifiers [27], may further improve the performance of data fusion methods.
Notably, the HR estimates derived from PPG and ECG show opposite trends.When employing the DeepMF method, ECG signals tend to underestimate HR (i.e., miss a few peaks), while PPG signals tend to overestimate it (i.e., identify more peaks than the real ones), as depicted in Figure 1.This contrasting behaviour makes this problem well-suited for data fusion methods, where the combination of different estimates compensates for distinct and opposite biases, resulting in a more reliable estimation.
Beat detectors for PPG performed well on high-quality PPG signals, but their performance decreases for noisy or low-amplitude signals such as those from in-ears.The reliability of PPG towards HR estimation has been questioned recently.Weiler et al. [28] compared averaged HR readings from PPG and ECG signals, and they did not find a statistically significant difference, but when the HR reached a value around 155-160 bpm, a difference of ±5 bpm was observed.Charlton et al. [29] evaluated eight different beat detectors for PPG and found that detectors performed well on hospital data and at rest, but performed worse during movement, stress, atrial fibrillation, and in neonates.In the study, detectors denoted MSPTD [30] and qppg [21] (used in this study) performed best, with complementary performance characteristics.MSPTD looks for peaks in PPG signals without using a priori knowledge of the characteristics of the signal, while qppg searches for systolic up-slopes based on their expected characteristics.
The performance of detectors may be improved; for example, Galli et al. [31] proposed an algorithm that combines three sequential signal processing stages of signal denoising by joint principal component analysis of PPG and accelerometer signals, Fourier-based heart rate measurement, and smoothing HR estimation via Kalman filtering.Galli et al. [31] showed that the average deviation from reference values was 1.66 bpm during running and 2.92 bpm during boxing activity.The development of a dedicated onset detector for in-ear PPG signals, such as DeepMF for in-ear ECG, is an interesting route for further study.
Moreover, PPG quality is affected by different skin colours, interfering reflection of light used for measurement, and disturbing optical measurements.Racial bias for blood oxygen saturation measurement using PPG were observed [32].Different measurement sites can have a thinner epidermis compared with the finger and lower exposure to sunlight and may be less prone to the influence of melanin and pigmentation [33].Hartman et al. [34] discussed that PPG acquired from different locations vary in amplitude and shape, and in some cases may be unsuitable for analysis.In Hartman et al.'s [34] study, 95% of recordings from the finger were suitable for analysis, followed by 86% of recordings on the wrist, and 81% on the earlobe.
Data fusion methods seem to be a necessity for PPG in the earlobe location, where the signal amplitudes are smaller and likely to be corrupted by motion artefacts.Data fusion methods and Kalman filter used by Li et al. [17] provide ways to reject outlier results.Further improvement of data fusion can be made by modifying the weighting equation.Our observations suggest that better results should come from the fusion operation in a winner-take-all fashion.We hypothesize that data fusion should mostly use the best available signal, and weights should be associated with the best signal and drop rapidly with a relative drop in SQIs.
Data fusion methods provide more robust HR estimation than a single cardiovascular signal.In particular, data fusion methods are useful for data recorded during movement, where signals may be affected by motion artefacts.Data fusion methods through integration of multimodal signals available from in-ear location, can enhance the performance of a wearable device in HR tracking.
Future work will include the following: 1.
The enhancement of data fusion methods by refining the assessment of weights.

2.
The development of a PPG beat detector optimized for low-amplitude in-ear PPG signals.

3.
The improvement of SQI estimation methods towards more reliable HR estimation.

Figure 1 .
Figure 1.Scatter plots of HR estimated during sitting, from in-ear source signals and with data fusion methods, relative to the ground truth HR estimated from torso ECG.Different colors are associated with different subjects.

Figure 2 .
Figure 2. Relationship between the MAE value and SQI for different sources.

Table 1 .
[18]es of weights associated by Rankawat and Dubey[18]based on SQI and source type.

Table 2 .
MAE values for single-source HR estimation and data fusion methods during the sitting activity.The smallest value in each row is designated in bold.

Table 3 .
MAE values for single-source HR estimation and data fusion methods during the walking activity.The smallest value in each row is designated in bold.