Development of a Personalized Multiclass Classification Model to Detect Blood Pressure Variations Associated with Physical or Cognitive Workload

Comprehending the regulatory mechanisms influencing blood pressure control is pivotal for continuous monitoring of this parameter. Implementing a personalized machine learning model, utilizing data-driven features, presents an opportunity to facilitate tracking blood pressure fluctuations in various conditions. In this work, data-driven photoplethysmograph features extracted from the brachial and digital arteries of 28 healthy subjects were used to feed a random forest classifier in an attempt to develop a system capable of tracking blood pressure. We evaluated the behavior of this latter classifier according to the different sizes of the training set and degrees of personalization used. Aggregated accuracy, precision, recall, and F1-score were equal to 95.1%, 95.2%, 95%, and 95.4% when 30% of a target subject’s pulse waveforms were combined with five randomly selected source subjects available in the dataset. Experimental findings illustrated that incorporating a pre-training stage with data from different subjects made it viable to discern morphological distinctions in beat-to-beat pulse waveforms under conditions of cognitive or physical workload.


Introduction
Population growth and the aging demographic are recognized as predominant factors linked to the rise in the incidence of cardiovascular diseases (CVDs) [1].The estimated increment in the number of adults aged 30 to 79 ranged from approximately 650 million to 1.28 billion between 1990 and 2019 [2,3].Hypertension, also known as elevated blood pressure (BP), is a medical condition that promotes the insurgence of coronary diseases and different pathologies impacting vital organs, such as the brain and kidneys [4][5][6][7].
Noninvasive BP measurement systems have emerged to overcome the mentioned limitation of the invasive approach [8].These devices employ multi-modal sensors that exploit diverse physical principles to extract information regarding the status of the cardiovascular system [9][10][11][12][13].Continuous arterial BP measurement has been widely acknowledged as a more accurate determinant of cardiovascular risks since alterations in systolic or diastolic BP, along with changes in the shape of BP waveforms over time, reflect the progression of arterial and arteriolar modifications [14,15].Moreover, by analyzing the arterial pressure waveforms, the cardiovascular status can be assessed through the estimation of physiological parameters [16,17].A promising approach for continuous BP measurement is through computational modeling of the circulatory system [18][19][20].These models integrate noninvasive data, like aortic flow and peripheral readings, to estimate BP.Parallelly, calibrated methods for cuffless BP, including Pulse Transit Time (PTT) and pulse wave analysis (PWA), stand out as viable solutions for BP assessment [21].PTT is the time delay for the pressure wave to travel between proximal and distal arterial sites [22].As defined in different arterial stiffness studies, PTT is conventionally identified as the foot-to-foot time delay between proximal and distal arterial BP waveforms [23,24].PWA relies on extracting features from an arterial waveform and associating them with BP units through a calibration model.This approach offers greater convenience compared to the PTT method, as it necessitates only a single sensor [25] or can be employed in conjunction with PTT to enhance its accuracy.PWA employs data-driven feature extraction to extrapolate relevant information for the BP assessment.Numerous features have been investigated to analyze the morphology of the arterial pulse waves captured by photoplethysmography (PPG) sensors [26][27][28].
Several machine learning (ML) algorithms, including Support Vector Machines (SVM), random forests (RF), and feedforward neural networks (NN), have been employed in BP assessment [29][30][31].In many instances, nonlinear models have demonstrated superior performance compared to linear models, although this outcome is contingent on the specific dataset and approach utilized, i.e., PTT, Pulse Arrival Time (PAT), or PWA using only PPG data.Additionally, more advanced methods, such as Recurrent Neural Networks (RNN) [32] and Long Short-Term Memory (LSTM) networks [33], have also been proposed.Although these models may offer a significant advantage over previously mentioned models by incorporating the ability to capture variations in extracted features over time, Deep Learning (DL) models require a large number of data samples to provide reasonably accurate BP values.Although DL and ML have found extensive application in BP assessment, the considerable inter-subject variability has posed challenges in formulating a sufficiently generalized model whose performance could also be maintained outside of the initial dataset.Therefore, drawing inspiration from established practices in the field of human activity recognition [34,35], numerous studies have suggested the formulation of person-specific models for the examination of this clinical parameter [36][37][38][39].
This paper presents a personalized ML model designed to detect BP changes in response to various stimuli.For this specific application, we developed a customized acquisition system to perform real-time acquisition and visualization of raw PPG pulse waveforms at the level of brachial (elbow) [10] and digital (thumb) [40] arteries.A specific data collection protocol was deployed to perform an accurate assessment and to induce a BP variation owing to the execution of both physical and cognitive tasks [41,42].
Our contributions in this application are as follows: (1) we propose ten pre-training strategies to calibrate an RF model based on individual physiological characteristics; (2) pretraining improves blood pressure classification accuracy on beat-to-beat pulse waveforms by up to 60% compared to a generalized model; (3) overfitting is mitigated by expanding the number of source subjects, reducing the need for additional target subject data; (4) we cut the data required to personalize the model by up to 30% while maintaining evaluation metrics above 95%.The structure of this article is as follows.Section 2 guides the reader through a detailed description of the hardware design of the device developed to retrieve the PPG raw data used in this work.Then, the data capture protocol and the data processing pipeline are presented, along with the description of the training strategies and the evaluation metrics employed to quantify the performance of the ML model.In Section 3, the results of the processing stage are reported, followed by the results retrieved for each subject in the dataset.Section 4 details the discussion and limitations of the proposed analysis.Finally, Section 5 concludes the paper.

Hardware Device
In this work, a custom-designed data acquisition system was employed to collect the arterial pulse waveforms between the brachial artery and the digital artery [43].Figure 1 illustrates the fabricated supports and their positioning on the subject during data collection.The thumb-mounted holder (Figure 1, center) mimics a standard pulse oximeter design, featuring an elastic spring for sensor adherence to the finger.On the top left side of Figure 1, the sensor holder is seen to be positioned on the elbow, with mechanical fixtures that allow the operator to regulate arm pressure, making it steady during the data acquisition process as recommended in [21,44,45].Each enclosure was designed to guarantee that the sensor maintains firm contact with the sample site, applying steady pressure and avoiding the need for the operator to hold it in its position.This feature enhances the replicability of the acquisition setup for any specific subject, thereby improving measurement consistency.Further information on the developed hardware device is available in [43].

Data Collection Protocol
A pre-clinical trial was carried out at the Tyndall National Institute in University College Cork (UCC), Cork, Ireland, to study the blood pressure variations related to the execution of cognitive and physical tasks.In this experiment, approved by the UCC Clinical Research Ethics Committee of the Cork Teaching Hospitals, a cohort of 31 healthy volunteers ranging from 21 to 34 years was recruited.Table 1 details the physiological parameters of the people involved.In accordance with the guidelines established for accurate BP measurement [46], every participant included in this study did not have any pre-existing cardiovascular condition and was not undergoing treatment with medications that could influence BP readings.Moreover, every individual was asked to refrain from smoking or consuming coffee in the 60 min before the session.The first step in the data capture consisted of the individual reclining in a supine position for 10 min to ensure that their hemodynamic conditions and vasomotor tone returned to a baseline level.Subsequently, the subject received instructions regarding the prescribed posture for data capture, which included sitting with back support, both feet flat on the floor, and hands resting on the table at a height equivalent to that of the heart.In accordance with the study protocol reported in Figure 2, after obtaining the anamnesis information, the operator identified the best location for the brachial artery through tactile arterial palpation.Once located, the spot was marked with ink to be sure that the acquisition site did not change over the duration of the data capture.Each data collection session was divided into three principal sections, denoted as follows: the resting phase (REST), the phase dedicated to cognitive testing (CT), and the after-exercise phase (AE).During each phase, a series of three data acquisitions, each one lasting one minute, was performed using the presented device.Then, the commercial cuff-based device BPM Connect [47] (Withings, Issy-les-Moulineaux, France) was used as a gold standard to measure the reference BP values for each specific section.In total, a set of three reference measurements was collected throughout the entire data collection.To prevent any potential recovery effects between measurements using both devices, we conducted the reference assessment immediately after completing the three acquisitions.Each phase was designed to induce changes in both blood pressure and PPG data collected from each participant.The resulting alterations in the PPG pulse waveforms are illustrated in the bottom section of Figure 2. In the CT section, the subject was cognitively stimulated through two cognitive tests: the Stroop test [48] and the n-back test [49,50].Both tests were executed through a customdesigned graphical user interface (GUI) structured to make a gradual augmentation in the level of complexity.Prior to commencing the actual measurement, the operator provided the participant with detailed instructions regarding the tests to be undertaken.Additionally, the participants had the opportunity to familiarize themselves with the GUI through the execution of a short demonstration.Then, the device was positioned on the subject.The last three minutes of this section were recorded during the execution of the n-back test and later subdivided into the three acquisitions related to the CT part.Hence, the reference BP measurement was taken again with the Withings device.Finally, the AE section of the data capture was performed.During this stage, each subject was engaged in a 10-min walking session on a calibrated treadmill.The treadmill's configuration was kept uniform across all data collection sessions.The speed was configured at 8 km/h, and the inclination was adjusted to its maximum level to induce BP variation even in trained subjects.Then, the last three acquisitions with the proposed device were carried out along with the last reference BP measurement.

Data Processing
The data processing pipeline designed for this application can be divided into three major sections: pre-processing, pulse wave quality assessment (PWQA), and lastly, the identification of specific fiducial points employed to derive the features for data analysis.Thumb and elbow PPG measurements were processed following the same procedure within the MATLAB (Natick, MA, USA) environment.Each acquisition was band-pass filtered between 0.3 Hz and 15 Hz [51], to remove the DC offset and the high-frequency noise.Time series segmentation and labeling remain challenging areas of study, with researchers exploring various methods to improve accuracy and efficiency [52,53].
This study addressed this challenge by segmenting our collected data into 3-s, consecutive, non-overlapping windows.This initial segmentation was used to identify and remove portions of the signal possibly corrupted by the presence of motion artifacts [54].Subsequently, every single pulse wave within the acquisition was identified through the localization of the pulse onset, known in the literature as the beginning of the systolic phase within the cardiac cycle.The template matching approach was selected to perform the PWQA [55].As the first step, a reference template was computed from all the available epochs.Then, Pearson's correlation coefficient was used as the signal quality index (SQI) between the ith pulse and the template.All the pulses showing an indicator below the defined acceptance threshold (i.e., 0.95, 0.95, 0.9, respectively, for REST, CT, and AE) were marked as unacceptable and discarded.Feature measurements were obtained from the PPG pulse wave through the identification of key reference points on the pulse wave and its derivatives, which were then used to compute a variety of characteristics.The identified fiducial points included the systolic peak (sys), dicrotic notch (dic), and diastolic peak (dia) on the pulse wave, as well as the point of maximum upslope on the first derivative (ms).Additionally, the a, b, c, d, e, and f waves on the second derivative were detected [56][57][58][59].These reference points are visually represented in Figure 3 for the baseline radial artery PPG pulse wave.Detailed criteria for identifying these fiducial points and features extracted are provided in Tables 2 and 3, respectively.The greatest maximum of s ′′ between b and e (or, if no maxima, then the first maximum on x ′ after e d The lowest minimum on s ′′ after c and before e (or, if no minima, then coincident with c). e The second maximum of s ′′ after ms and before 0.6 T (unless the c wave is an inflection point, in which case take the first maximum).f The first local minimum of s ′′ after e and before 0.8 T.
Abbreviations: PPG, photoplethysmogram; VPG velocity plethysmography; APG, acceleration plethysmogram; s, original pulse; s ′ , first derivative of the original pulse; s ′′ , second derivative of the original pulse.Amplitude of the dicrotic notch Amplitude of the diastolic peak

Model Training
This study examined the efficacy of personalized against generalized training strategies to identify significant alterations in blood pressure levels.As delineated in section II-B, the data collection protocol was meticulously structured to induce BP variations through the execution of both physical and cognitive tasks.This setup enabled a thorough investigation of BP fluctuations in individuals subjected to diverse stimuli.In this context, a macroscopic variation in BP was defined as the difference between the reference values measured throughout the data collection procedure, regardless of the magnitude.Consequently, the phases of the data capture process (e.g., REST, CT, AE) were used as target labels for the analysis, as they inherently reflected BP changes.
Our investigation compared the outcomes derived from applying ten different personspecific models (PSM) against those obtained by a person-independent model (PIM) when applied to the identical dataset, utilizing an RF classifier.Although features were extracted from signals at both sites, only those from the thumb were utilized.Pulse waves collected from the elbow were used to calculate the PTT, which was then used as a feature.Figure 4 (left branch) shows the workflow for our generalized approach.To optimize the model's performance, we used a Leave-One-Subject-Out strategy across all users in the dataset.The optimization process involved the following parameters: the number of trees in each forest, which ranged from 50 to 100; the maximum depth of each tree in the forest, which ranged from 10 to 50; and the number of features used in the training process, which ranged from 3 to 25.The feature selection process was applied only at the training stages by ranking the first n features according to the mutual information between each feature and the target label.The right branch, highlighted in red, summarizes the ten personalized strategies.The tested PSMs differed in the quantity of data used during the training phase and the fraction of the target subject data employed to personalize the model.Starting from PSM SD , in which we used 50% of the data from the kth subject for training, data from randomly selected individuals were gradually added to the training set.Specifically, the number of source subjects varied across 5, 10, and 15 individuals.Different fractions of the target subject data were also tested to customize the model.This feature was progressively expanded, beginning from an initial value of 15%, and subsequently increased to 30% and 50%.Each combination of these parameters, when applied to the RF, was labeled as PSM i,j where i identifies the number of source individuals, i ∈ 5, 10, 15, and j refers to the percentage of data belonging to the kth target subject j ∈ 15%, 30%, 50%.The right side of Figure 4 details the workflow followed by each PSM i,j before applying the RF model.The initial step involved randomly selecting a portion of data samples from each class.To avoid class imbalance, we made sure that each class was equally represented by selecting 15%, 30%, or 50% of pulse waveforms from each class.Following this, pulse waveforms from different source subjects (5, 10, or 15) were included.
Then, unlike the generalized approach, the PSM method incorporated a 10-fold crossvalidation to fine-tune the model's hyperparameters and identify the most informative subset of features.Finally, all the mentioned solutions applied the RF model to predict the actual class of the input pulses.The output from each subgroup was merged for visualization purposes, although each PSM was tested individually.

Evaluation Metrics
As described in the previous subsection, the ten presented models were trained and tested for each subject in the dataset.Therefore, metrics such as accuracy, precision, recall, and F1-score were computed to evaluate the fluctuations in classification performance from subject to subject.Finally, an average of all indexes was computed along with its standard deviation to summarize the performance of each model.In a multiclass classification problem with three classes (REST, CT, and AE), the definitions are as follows:   The accuracy score, computed as the ratio of correctly predicted instances over of the total number of instances, was used to quantify the correctness of the predicted labels compared to the actual labels Equation (1).

Accuracy =
∑ True Positives for All Classes Total Instances Given these definitions, the evaluation metrics such as precision and recall scores were computed individually for each of the three classes referring to different sections of the data capture (α ∈ REST, CT, AE) as specified in Equations ( 2) and (3).
where FP α and FN α are the overall numbers of false positives and false negatives referred to the target class α ∈ REST, CT, and AE under evaluation.Then, for every tested subject sub i , the macro-averaged value of precision and recall, Equations ( 4) and ( 5), was calculated according to the following: Precision α (4) where N is the number of classes occurring in this study.
Finally, the the macro-averaged F1-score was computed as reported in Equation ( 6):

Data Processing
Table 4 shows the results of the three processing stages described divided per section of the data capture (REST, CT, AE), for a single site.After eliminating epochs corrupted by motion artifacts, the total number of segmented waveforms amounted to 19,274, distributed as follows: 6348 for the resting phase (REST), 6213 during cognitive testing (CT), and 6713 after the physical exercise phase (AE).Data collected from subject #26 were discarded due to corruption of the recording on both sites during the CT section.The variance in the number of detected waves aligned with the execution of the scheduled tasks during the data capture.Specifically, the approximately 400-wave difference between the REST section and the measurement following treadmill walking could be attributed to the observed increases in heart rate (HR) and BP in the measurements conducted with the reference device.Regarding the CT section, although there was an increase in SBP and DBP values (Table 5), the heart rate remained essentially constant compared with the resting value.This phenomenon was reflected in the number of waves detected (6213, CT vs. 6348, REST).As a result of the PWQA, approximately 3.2% of the total pulses were excluded due to their insufficient similarity to the reference template.Due to the low data quality found in the CT section, data from subjects #17 and #29 were discarded from the dataset used for data analysis.Finally, following the validation of the fiducial points, an additional 4% of data points were discarded for a total of 17,886 pulse waves used in the analysis phase collected from 28 out of 31 subjects.

Model Evaluation
Table 6 compares the aggregated BP classification performance between ten different PSMs with the results achieved using a generalized approach.The results were expressed in terms of mean value and related standard deviation using the scoring criteria (accuracy, precision, recall, and F1-score) mentioned in Section 2.5.The evaluation metrics computed for each subject according to the training strategy are reported in Tables A1-A8 in Appendix A. The person-independent model, denoted as PIM, was trained across the complete dataset employing a Leave-One-Subject-Out cross-validation.Subsequently, performance evaluation was conducted by aggregating the outcomes obtained for each individual.No personalization was applied in this case.The low scores retrieved for each metric (0.36, 0.36, 0.31, and 0.37) suggest the requirement for personalization to model the PPG-BP relationship effectively.Figure 6 displays the results gathered using the RF trained using 50% of the data of the target subject under investigation (PSM SD ).In this case, subjects #2, #14, #21, #25, and #28 displayed a marked decrease in classification performance, showcasing accuracy values of 0.63, 0.67, 0.75, 0.78, and 0.67, alongside precision values of 0.43, 0.5, 0.8, 0.8, and 0.5.Through a systematic assessment using the proposed training approaches, we examined how an alteration in the number of subjects and the percentage of data used for model customization impacted the classification performance.The evaluation metrics computed for each personalized model (PSM i,j ) are reported in Table 6, and the average accuracy score is represented in Figure 7.
In the latter figure, two discernible trends can be identified.Specifically, the average accuracy is directly correlated with the increase in the percentage of data utilized during the pre-training phase and inversely correlated with an augmentation in the number of subjects.The observed accuracy values of 96.4%, 95.7%, and 94.5% in the first set (e.g., PSM 5,50% PSM 10,50% PSM 15,50% ) declined to 95.1%, 92.6%, and 91.6% in the second set (e.g., PSM 5,30% , PSM 10,30% , PSM 15,30% ), and further decreased to 91.4%, 87.4%, and 86% in the third set (e.g., PSM 5,15% , PSM 10,15% , PSM 15,15% ).We excluded combinations that demonstrated overfitting across multiple subjects by discarding those with accuracy and F1 values below 0.95.Therefore, PSM 5,30% , PSM 5,50% , and PSM 10,50% were selected.Given the comparable overall performances across subjects for these combinations, our choice for the best combination was guided by a balance between performance metrics and the minimized data requirement for model customization.This led us to favor PSM 5,30% .

Discussion
This study compared the performance of person-dependent and generalized models adopted to track BP macro-variations associated with physical or cognitive workload using a random forest classifier.This model was chosen due to its ability to handle the nonlinear relationships that exist between the extracted features and the variation in BP [57].In other studies, RF outperformed other nonlinear models such as SVM adopting a nonlinear kernel and neural networks [60].Moreover, RF is less prone to overfitting compared to the other two mentioned MLAs [29].Generalized solutions often struggle with the high inter-subject variability within the dataset, making it challenging to develop a universally applicable model.The choice between personalized and universal models depends on the specific context and objectives of the problem being addressed.
Personalized models, finely tuned to individual users' characteristics, take into account factors like age, gender, medical history, and lifestyle to provide more accurate and relevant predictions of BP.This tailored precision proves particularly crucial for individuals affected by complex health conditions or unique risk factors.Despite these advantages, the construction and maintenance of personalized models for each user pose challenges.This process can be resource-intensive, especially in the field of large-scale applications involving a significant number of subjects.Moreover, privacy and data protection concerns come to the forefront, as the development of personalized models often necessitates access to sensitive user data.
Generalized models, in contrast, are crafted to exhibit proficiency across a diverse spectrum of users without the need for individual customization.This inherent versatility makes them more scalable and simpler to implement, eliminating the necessity for tailoring to each user's unique attributes.The cost-effectiveness and ease of maintenance associated with generalized models make them particularly advantageous for applications boasting a large user base.However, this broad approach comes with a trade-off since generalized models may fail to capture the distinctive characteristics and preferences of individual users.Consequently, the predictions generated by these models may result in a lack of accuracy compared to their personalized counterparts.
This phenomenon, highlighted in [61], is also reflected in our findings where the averaged metrics of the generalized approach (0.36, 0.36, 0.31, 0.37) underline the difficulties in defining a univocal representative model for subjects with different physiological characteristics.A hybrid approach combining personalized and universal models, as investigated in this study, may be beneficial for blood pressure monitoring.A universal model could be used as a baseline to provide initial predictions for all users, and personalized models could be applied to increase the model's performance where personalization is deemed critical, such as users with complex health conditions or unique risk factors accommodating the inherent diversity in BP patterns among different subjects.
In [62], the authors used a transfer learning technique that personalized specific layers of a pre-trained network to improve the performance of PPG-based BP estimation, highlighting the influence of the number of data samples and source subjects used for training.Our analysis of the results shows that, on average, all the PSMs improved the performance of the generalized model regardless of the number of source subjects employed for training.Moreover, by observing the metrics displayed in Table 6, strategies including data obtained from different individuals demonstrated better performance in comparison to the model constructed exclusively using data from the tested subject (PSM SD ) where, as reported in Figure 6, the classification performance of eight subjects witnessed a substantial decline.Subject #2 emerged as the most adversely affected, exhibiting a notable drop of all metrics up to 0.63, 0.43, 0.67, and 0.52 for accuracy, precision, recall, and F1-score, respectively.These fluctuations in classification performance are a direct consequence of the phenomenon of overfitting whereby the model cannot correctly predict data that differ from the small training set available.To mitigate this issue, we included data from 5, 10, or 15 randomly selected subjects from the dataset in addition to diverse fractions of the target subject's data (15%, 30%, 50%).In this way, we were able to evaluate the behavior of the model according to different sizes of the training set, degrees of personalization, and combinations of hyperparameters.Table 6 revealed a distinct inverse correlation between the classification metrics and the increase in the number of individuals.This diminishing pattern suggests the potential implications linked to the higher variability introduced by additional source subjects with respect to the initial quantity of data used to pre-train the model.Hence, this phenomenon may reduce random forest customization and consecutively lead to poorer classification performance for the target subject under evaluation.In fact, as depicted in Figure 7, PSM 5,15% , PSM 10,15% , and PSM 15,15% showed a more pronounced decrease in accuracy value as the number of subjects increased compared to the cases with 30% and 50% of the target subject's data.
This phenomenon is further visible in Figures 8 and 9. Notably, when utilizing only 30% of the data for the pre-training stage, this adverse trend was further accentuated by a more pronounced variability (Figure 8b,c) compared to the scenario with 50% of the data (Figure 9b,c), where the standard deviation was progressively reduced.In the definition of the best solution within the context of our application, we opted to discard any tested combinations exhibiting aggregated accuracy and F1 values below 0.95.This approach ensured that combinations displaying overfitting across multiple subjects were not considered.As result, our selected PSMs were confined to PSM 5,30% , PSM 5,50% , and PSM 10,50% .Upon analyzing the performance of various combinations across subjects within the dataset, it is evident that their overall performance values were generally comparable.However, an exception arose with subject #21, Figures 8a and 9a,b, which exhibited a drop in performance exceeding 10% compared to the training phase in all three combinations, although slightly less evident in PSM 10,50% .This trend was attributed to subject #21 having the lowest number of pulses in the dataset, resulting in a diminished dataset for personalization compared to other subjects.Notably, the combination PSM 15,50% demonstrated a substantial improvement, utilizing more data for personalization along with an increased number of individuals.Hence, due to the similarity observed among the performance metrics, the selection of the best combination was guided by the consideration of the data required for model customization, leading us to favor PSM 5,30% .Employing 30% of the total available data, equivalent to approximately 162 s for the personalization phase, represents a noteworthy outcome.This achievement is particularly significant as it reflects a substantial reduction in the time required for this task compared to the approach outlined in [62], where 250 s of data recording per subject was used for the pre-training stage.Therefore, combining a subset of source subjects, in conjunction with an adequate fraction of data for pre-training leads to increased robustness and generalizability of personalized models across a broader spectrum of cases in BP assessment when compared to standard generalized models.Despite the mentioned improvements, some limitations of the proposed study need to be discussed.In this study, the performance of the proposed approach was evaluated on a limited sample of 28 subjects, falling short of the 85 subjects required by the AAMI [29].To enhance model validation and generalization for accurate BP monitoring, it is crucial to include a diverse range of values that truly represent the population, including both males and females across different age ranges.In our future endeavors, we intend to extend the validation process to encompass a larger and more diverse cohort of individuals, aligning with the standards set by AAMI.Typically, to assess blood pressure variations, multiple sets of data collection over several days are conducted to ensure the algorithm's consistent performance over time for the same individual.However, it is crucial to note that our data collection protocol was designed to induce short-term variations in BP linked to diverse stimuli rather than long-term monitoring.Moreover, increased proficiency in the cognitive tests section would likely result in reduced BP variation due to heightened familiarity with the tasks.

Conclusions
Cuffless blood pressure measurement has gained attention due to clinical demand and recent technological advances in the fields of data acquisition systems, embedded systems, and machine learning techniques.This paper presented a personalized multiclass classification model aimed at the detection of blood pressure variations associated with physical or cognitive workload.Several training strategies were implemented, each differing in the percentages of the dataset and utilizing a diverse subset of individuals as the training set.Experimental results demonstrated that the inclusion of a pre-training stage with data from diverse subjects enabled the discernment of morphological distinctions in beat-to-beat PPG waveforms under various stressors with respect to a generalized model fitted on the whole dataset.Understanding the regulatory mechanisms influencing blood pressure, combined with a reduction in the number of sensors employed to track this latter, constitutes a further step toward unobtrusive cuffless BP monitoring, resulting in better management of this parameter.

Figure 1 .
Figure 1.System employed to collect PPG raw data from the selected sites, elbow (brachial artery) and thumb (digital artery).

Figure 2 .
Figure 2. Data collection protocol followed in this study along with the evolution of the averaged pulse waveforms morphology according to each section of the data capture.

Figure 3 .
Figure 3. (a) Feature extracted from a PPG waveform.(b) Maximum of the first derivative (ms) detected on the velocity plethysmography (VPG).(c) Fiducial points detected on the acceleration plethysmography (APG).

Table 2 .
Criteria for identifying fiducial points on PPG pulse waves.SysMaximum of the pulse waveform Dic First local minimum after the systolic peak or coincident with e Dia First local maximum after dic and before 0.8 T (where T is the duration of the cardiac cycle) VPG, s ′ ms Maximum of the first derivative, s ′ APG, s ′′ a The maximum of s ′′ preceding the maximum of the first derivative ms b First local minimum following a c

Figure 4 .
Figure 4. Overview of the tested training strategies.(Left) Workflow employed for the generalized approach (PIM).(Center) Tested combinations for person-specific strategies (PSMs).(Right) Workflow adopted by every PSM i,j .

•
True positives (TP): correctly predicted instances of a class.• False positives (FP): instances incorrectly predicted as a certain class.• False negatives (FN): instances of a class that are incorrectly predicted as another class.• True negatives (TN): all instances that are correctly not classified as the class under evaluation.

Figure 5
Figure 5 clarifies how TP, FP, FN, and TN are identified respectively for each class.

Figure 6 .
Figure 6.Values of evaluation metrics (accuracy, precision, recall, and F1-score) according to the training strategy denoted as PSM SD .A 90% threshold (indicated by the gray dashed line) is used to identify subjects whose performance drops by more than 10% compared to the training phase.

Figure 7 .
Figure 7. Averaged values of accuracy score according to different combinations of number of source subjects and diverse fractions of data employed to personalize the RF model.

Figure 8 .
Figure 8. Evaluation metrics computed for each individual employing a fraction of the target subject data set equal to 30% and a diverse number of source subjects (N).A 90% threshold (indicated by the gray dashed line) is used to identify subjects whose performance drops by more than 10% compared to the training phase.(a) N = 5.(b) N = 10.(c) N = 15.

Figure 9 .
Figure 9. Evaluation metrics computed for each individual employing a fraction the target subject data set equal to 50% and a diverse number of source subjects (N).A 90% threshold (indicated by the gray dashed line) is used to identify subjects whose performance drops by more than 10% compared to the training phase.(a) N = 5.(b) N = 10.(c) N = 15.

Table 1 .
Overview of the characteristics of the study populations.

Table 3 .
Definition of the extracted features from PPG pulse wave and its derivatives.Time delay between systolic and diastolic peaks t dia − t sys SI Stiffness index, h is the subject's height h/(t dia − t sys ) sys Mean value of the systolic phase of the pulse waveform s µ,diaMean value of the diastolic phase of the pulse waveform - 1Area under the curve between the pulse onset (t 0 ) and the dicrotic notch (t dia ) -A 2Area under the curve between the dicrotic notch (t dia ) and the end of the pulse (t end ) -IPA Inflection point area A 2 /A 1VPG, s ′ Time t ms Time to the maximum slope computed on the first derivative of the pulse t ms − t 0 Amplitude A ms Amplitude of the maximum slope s ′ (t ms ) c Amplitude b/a Amplitude ratio of early systolic negative wave over early systolic positive wave s ′′ (t b )/s ′′ (t a ) c/a Amplitude ratio of late systolic re-increasing wave over early systolic positive wave s ′′ (t c )/s ′′ (t a ) d/a Amplitude ratio of late systolic decreasing wave over early systolic positive wave s ′′ (t d )/s ′′ (t a ) e/a Amplitude ratio of early diastolic positive wave over early systolic positive wave s ′′ (t e )/s ′′ (t a )

Table 4 .
Data processing results.
Abbreviations: REST, measurements at rest; CT, cognitive task section; AE, measurements after physical tasks; PWQA, pulse wave quality assessment

Table 5 .
Averaged reference blood pressure values.

Table 6 .
Macro-averaged evaluation metrics computed for each model.
Abbreviations: MLA, machine learning algorithm; PIM, person independent model; PSM SD , person-specific model with 50% of data from kth subject for training set, 25% as validation set, and 25% as test set; PSM i,j , personspecific model where i identifies the number of source individuals, i ∈ 5, 10, 15, and j refers to the percentage of data belonging to the kth target subject j ∈ 15%, 30%, 50%.Note: * macro-averaged values computed on the 28 subjects employed for the analysis.

Table A2 .
Test set results using person-specific model PSM SD .