Sleep Monitoring during Acute Stroke Rehabilitation: Toward Automated Measurement Using Multimodal Wireless Sensors

Sleep plays a critical role in stroke recovery. However, there are limited practices to measure sleep for individuals with stroke, thus inhibiting our ability to identify and treat poor sleep quality. Wireless, body-worn sensors offer a solution for continuous sleep monitoring. In this study, we explored the feasibility of (1) collecting overnight biophysical data from patients with subacute stroke using a simple sensor system and (2) constructing machine-learned algorithms to detect sleep stages. Ten individuals with stroke in an inpatient rehabilitation hospital wore two wireless sensors during a single night of sleep. Polysomnography served as ground truth to classify different sleep stages. A population model, trained on data from multiple patients and tested on data from a separate patient, performed poorly for this limited sample. Personal models trained on data from one patient and tested on separate data from the same patient demonstrated markedly improved performance over population models and research-grade wearable devices to detect sleep/wake. Ultimately, the heterogeneity of biophysical signals after stroke may present a challenge in building generalizable population models. Personal models offer a provisional method to capture high-resolution sleep metrics from simple wearable sensors by leveraging a single night of polysomnography data.


Introduction
It is increasingly clear that there is an important connection between stroke and sleep. A recent observational study of over 80,000 people found that individuals with insomnia had a 54% increased risk of stroke in the ensuing 4 years compared to age-matched noninsomniacs [1]. After stroke, an estimated 27% of patients report hypersomnia and excessive daytime sleepiness [2], while an estimated 57% of patients report insomnia [3]. Poor sleep has a detrimental impact on recovery, with delayed neuroplasticity and motor learning [4]. Increases in the degree of insomnia are negatively correlated with Barthel index, which measures positive changes in physical function during daily activities [5]. Alternatively, early animal studies have also shown that an increase in slow wave sleep after a stroke (i.e., using optogenetic stimulation) induces positive sleep-dependent plasticity resulting in better functional recovery [6]. Together, these studies emphasize that poor sleep can impede recovery after stroke, and that improving sleep could be one strategy to lead to better patient outcomes.
Before we can provide meaningful interventions to improve sleep and neural recovery for patients with stroke, it is essential to accurately measure sleep and identify deficits.
Currently, there are limited practices in place to evaluate sleep quality in a hospital setting. The most common approach is ordering a sleep log, in which the care team manually checks on the patients periodically throughout the night and notes whether they are awake or asleep. This practice has low sensitivity to measure sleep quantity and provides no indicator of quality, especially in an acute stroke rehabilitation setting in which a certain sleep stage (e.g., slow wave sleep) is important for recovery, and sleep time is fixed by care schedule. At the other extreme, polysomnography (PSG), which consists of electroencephalogram (EEG), photoplethysmography (PPG), electrocardiogram (ECG), capnography, and/or respiration measurements, is the current gold standard technique to evaluate sleep architecture in clinical or research settings [7]. However, PSG equipment for a widespread, long-term implementation is impractical in an inpatient rehabilitation facility (IRF) due to the high system cost and high workload to collect and analyze. Furthermore, patients often report discomfort wearing numerous wired devices, which are bulky and can disrupt their sleep.
Wireless, wearable sensors could address these limitations. As engineering advances, these devices are becoming more flexible, lightweight, and cost effective. These sensors also have the capability to collect heart rates, body temperatures, or oxygen saturation that can be biomarkers for quantifying the different sleep stages through the modulation of the autonomic nervous system (ANS). ANS activities are often coupled with neuronal activities at the central nervous system (CNS) [8][9][10][11]. This CNS-ANS coupling network directly affect cardiovascular responses during sleep. Today, many commercial activity trackers and smartwatches provide sleep quality measures and sleep staging, though the accuracy of these devices for individuals with stroke is often unclear. Sensors that provide access to raw biophysical signals, when paired with advanced machine learning algorithms, offer the ability to construct sleep detection algorithms from scratch and optimize their performance for specific use cases. We have previously demonstrated that an algorithm trained from a set of low-profile sensors measuring motion, heart activity, and skin temperature was promising to detect different sleep stages (i.e., light sleep, deep sleep, or rapid eye movement (REM) sleep) in healthy individuals [12]. To our knowledge, no algorithms have been constructed or validated for patients with stroke, whose ANS signals can be very different from healthy individuals due to their neurological injury [13] or medications used during treatment (e.g., beta blockers to manage heart rhythms).
In this study, we explored the feasibility of evaluating overnight sleep for patients with subacute stroke using multimodal wearable sensors. Similar to our previous work with healthy individuals [12], we obtained overnight sleep data from a preliminary sample of patients using PSG and wearable sensors. We tested two types of supervised machine learning approaches: population models (trained on data from a subset of patients and tested on data from a left-out patient) and personal models (trained on a subset of data from one patient and tested on left-out data from the same patient). Insights from this preliminary work can inform future studies of sleep monitoring for patients with stroke. High-resolution, objective sleep monitoring with wearable sensors would enable us to unobtrusively identify individuals at risk for poor, non-restorative sleep across care settings. In turn, this will empower clinicians and researchers to develop personalized interventions for improving sleep-and thus enhancing neural recovery during acute rehabilitation-for these individuals.

Participants
Ten individuals with stroke (5F/5M; age 58.1 ± 12.1 years) were recruited from the inpatient unit of the Shirley Ryan AbilityLab, a rehabilitation facility in Chicago, IL (USA), for a single night of sleep monitoring with PSG and wearable sensors. The Institutional Review Board Office at the Northwestern University (STU00206700) approved the protocol, and all patients provided informed, written consent prior to participation. Participants were required to be at least 18 years of age, understand spoken English at a sixth-grade level or higher, and have a primary diagnosis of stroke. Patients with diagnosed sleep disorders (e.g., obstructive or central sleep apnea) were excluded from the study at this preliminary stage. Table 1 summarizes the demographics and clinical characteristics of  the 10 participants. Table A1 (Appendix A) provides additional attributes about stroke for each participant. Electroencephalographic (EEG), electrooculographic (EOG), and submental electromyographic (EMG) signals, three-lead ECG, and a respiratory belt were recorded on a portable PSG system (Brain Vision; Morrisville, NC, USA). Ten EEG channels were applied following the international 10-20 system (ROC, LOC, C3, C4, F3, F4, P3, P4, O1, and O2).

Wearable Sensors
ANNE TM One (Sibel Health; Niles, IL, USA) is an FDA-cleared, clinical-grade sensor system with two soft, flexible devices: one adhered to the chest using an adhesive sticker to measure triaxial acceleration as well as ECG, heart rate, respiratory rate, and proximal skin temperature, and one placed on the finger to measure PPG for SpO2, peripheral arterial tonometry, and distal skin temperature [14]. The two devices are time-synchronized and connect to a tablet via Bluetooth. Acceleration was recorded at 52 Hz for the xand y-axes and 416 Hz for the z-axis (anteroposterior plane). ECG was recorded at 512 Hz, heart rate and respiratory rate at 1 Hz, PPG at 256 Hz, and skin temperature at 1 Hz. The system has been previously validated for sleep-related breathing disorders [15].
In addition, ActiWatch Spectrum (Philips, Cambridge, MA, USA) was used as a reference of a research-grade wearable sensor. ActiWatch was placed at the less affected side of the stroke patient. ActiWatch data were analyzed with the Autoscore sleep/wake algorithm from its software ActiWear (Philips, Cambridge, MA, USA).

Procedures
Initial screening was conducted via electronic medical records to identify patients who fulfilled the eligibility criteria, and medical clearance for participation was obtained from their main attending primary physician. Following consent, participants were asked to wear the PSG system and ANNE TM sensors for a single night of their inpatient stay ( Figure 1). Overnight recordings began at the participant's normal bedtime in the hospital and ended either the next morning, when the participant awoke, or after 8 h of recording.

Procedures
Initial screening was conducted via electronic medical records to identify patients who fulfilled the eligibility criteria, and medical clearance for participation was obtained from their main attending primary physician. Following consent, participants were asked to wear the PSG system and ANNE TM sensors for a single night of their inpatient stay ( Figure 1). Overnight recordings began at the participant's normal bedtime in the hospital and ended either the next morning, when the participant awoke, or after 8 h of recording.

Data Analysis
All preprocessing was performed in MATLAB R2017b. Data visualization, segmentation, feature extraction, and machine learning were performed in Python 3.9.7 using the numpy, scikit-learn, imblearn, and pandas libraries [16]. Statistical analyses were performed in R 4.1.0 with the caret package [17].
Signals from the PSG system and ANNE sensors were time-synchronized via crosscorrelation on their respective ECG signals (xcorr function in MATLAB). ANNE sensor data were cleaned by resampling to the expected sampling rate to ensure consistency, filtering, and extracting features for algorithm training and testing. A highpass fifth-order Butterworth filter was applied to accelerometer and ECG data with cutoff frequency at 1 Hz. PPG data were processed using the ANNE system's proprietary software (Sibel Health, Inc.) to obtain respiratory rate and oxygen saturation index (SpO2) at the fre-

Data Analysis
All preprocessing was performed in MATLAB R2017b. Data visualization, segmentation, feature extraction, and machine learning were performed in Python 3.9.7 using the numpy, scikit-learn, imblearn, and pandas libraries [16]. Statistical analyses were performed in R 4.1.0 with the caret package [17].
Signals from the PSG system and ANNE sensors were time-synchronized via crosscorrelation on their respective ECG signals (xcorr function in MATLAB). ANNE sensor data were cleaned by resampling to the expected sampling rate to ensure consistency, filtering, and extracting features for algorithm training and testing. A highpass fifth-order Butterworth filter was applied to accelerometer and ECG data with cutoff frequency at 1 Hz. PPG data were processed using the ANNE system's proprietary software (Sibel Health, Inc., Niles, IL, USA) to obtain respiratory rate and oxygen saturation index (SpO2) at the frequency of 5 Hz. The Pan−Tomkins algorithm [18] was applied to the ECG time series signal to detect R peaks, which were used to compute R-R intervals and subsequent features related to heart rate and heart rate variability.

PSG for Ground Truth Sleep Staging
PSG data were preprocessed using Brain Vision software (Morrisville, NC, USA). A Registered Polysomnographic Technologist visually scored each 30-s epoch of the PSG data as Wake, N1, N2, N3, or REM, in accordance with the American Academy of Sleep Medicine criteria [19]. These scores served as the ground truth for training and testing a machine learning algorithm to classify sleep stage based on the wearable sensor signals. To explore the various resolutions of sleep staging, we labeled the PSG scores for various sleep staging resolutions, including  (Table 2). Time domain features included the mean, standard deviation (STD), interquartile range (IQR), kurtosis, root mean square (RMS), variance, maximum, minimum, range, and inter-axis correlations. The following features related to heart rate were calculated: (1) number of successive R-R intervals that differ by more than 20 ms and 50 ms (i.e., NN20, NN50), and (2) percentage of successive R-R intervals that differ by more than 20 ms and 50 ms (i.e., PNN20, PNN50). A Fast Fourier transformation was used to estimate the power spectral density of the processed R-R intervals. These frequency domain features include zero crossing rate and power in the very low frequency (VLF), low frequency (LF), and high frequency (HF) bands. VLF was defined as the band from 0.0033 to 0.04 Hz, LF as 0.04 to 0.15 Hz, and HF as 0.15 to 0.4 Hz. Time spent in apnea (TSA) was calculated as the amount of time that SpO2 fell below certain thresholds, including 95%, 90%, 85%, 80%, and 70%. Oxygen desaturation index (ODI) was calculated as the number of times within each 30 s epoch that SpO2 decreased by a certain threshold from the previous epoch, including 2, 3, 4, and 5%. Features relating to the distal-to-proximal gradient (DPG) of skin temperature, which is the difference between the limb and chest temperature [20], were also calculated. Data were excluded if the minimum SpO2 was less than 50% or the minimum heart rate was less than 40, since these values are physiologically unrealistic for this patient cohort, and the features likely stemmed from noisy sensor signals.   TSA95  TSA90  TSA85  TSA80   TSA70  ODI2  ODI3 ODI4 ODI5 rho = correlation coefficient; p = correlation p-value; IQR = interquartile range; SD = standard deviation; RMS = root mean square; HR = heart rate; NNx (or PNNx) = sum (or percentage) of R-R intervals larger than x ms (or %); LF = low frequency; VLF = very low frequency; HF = high frequency; DPG = distal-to-proximal gradient; TSAx = time spent in apnea, with SpO2 below x%; SpO2 DI = mean absolute difference between successive mean values of SpO2 over 10-s intervals; ODIx = oxygen desaturation index for SpO2 dropping x% from the previous epoch; ZC = zero crossing rate.

t-Distributed Stochastic Neighbor Embedding (tSNE) Analysis
We applied tSNE, a non-linear dimensionality reduction technique, to visualize our high-dimensional feature set in two-dimensional space. tSNE graphs illustrate the similarity of data points across multiple features using probability-based clustering [21]. Following parameter exploration, we set perplexity to 40 and iteration to 300 for tSNE clustering. All other parameters were set to the default values from Python scikit-learn (version 1.0.2).

Class Imbalance
Class imbalance is an issue for most of the machine learning-based sleep monitoring system. Based on the four-sleep stage labels, we have collected 3484 epochs of light sleep (N1 and N2), 1237 epochs of REM sleep, 636 epochs of deep sleep (N3), and 469 epochs of wake. Since feeding a large amount of imbalanced training data into the model could result in overfitting toward the majority class, we remediate this situation with the following methods: (1) for random forest and bagging, we used class weights and random  (2) for Gradient Boosting and XGBoost, we used the Synthetic Minority Oversampling Technique (SMOTE) [22].

Model Development and Training
We constructed two types of supervised machine learning models, including population models (trained on data from a subset of subjects and tested on data from a held-out subject) and personal models (trained on a subset of data from one subject and tested on held-out data from that same subject). Four algorithms (Balanced Bagging, Balanced Random Forest, Gradient Boosting, and XGBoost) were compared in both the population and personal model frameworks for their ability to classify sleep stages at three different resolutions (2-, 3-, and 4-stage). In our initial exploration, we considered additional models, including linear discriminant classifiers, support vector machines, k-nearest neighbor, and a combination of recurrent and convolutional neural networks [23] for this problem. However, none of those models outperformed the ensemble and boosting algorithms presented here.
Prior to model training, correlated features (Pearson correlation coefficient greater than 0.95) were removed. All remaining features were normalized within subjects using z-score normalization to minimize the potential influence of baseline values and outliers on the model. Models were trained and tested using a nested cross-validation approach with an inner and outer loop. For population models, the outer loop consisted of leave-one-subject-out cross-validation (LOSOCV), wherein each subject was considered a separate fold. For personal models, the outer loop consisted of 10-fold cross-validation, wherein the data were divided evenly into 10 folds and stratified by class so that each labeled sleep stage was equally represented across folds. For both types of models, cross-validation was implemented by assigning one fold as the test set and the remaining folds as the training set; this process was repeated until each fold was tested, and model performance metrics were averaged across all tested folds. The inner loop consisted of feature selection and hyperparameter tuning with an additional 10-fold cross validation. Using recursive feature elimination (RFE) and grid search, respectively, features and hyperparameters were selected based on maximization of the weighted F 1 score.

Model Evaluation
Sleep stage predictions from each model were compared to ground truth PSG labels for each test dataset. Cohen's kappa was used for between-model comparisons and to quantify degree of agreement between the model predictions and ground truth. Cohen's kappa was interpreted with values <0 as worse-than-chance agreement, 0-0.20 as slight, 0.21-0.40 as fair, 0.41-0.60 as moderate, 0.61-0.80 as substantial, and 0.81-1 as near-perfect agreement [24].
A pooled Cohen's kappa was computed using all test data of each model to summarize the overall algorithm performance. We chose to pool rather than average the Cohen's kappa values because averaging would have an additive effect on the numerator variance but a multiplicative effect on the denominator variance. This difference in variance calculation may increase the error of estimate especially around the middle values (e.g., Cohen's kappa of 0.4-0.7) [25]. We also computed the following metrics for each class: (1) specificity, (2) precision, (3) sensitivity, (4) F 1 score, and (5) balanced accuracy. Balanced accuracy is the arithmetic mean of recall for each class of the data. The F 1 score is a harmonic average of precision and sensitivity, ranging from 0 (lowest performance) to 1 (highest performance), and is computed according to Equation (1):

Sensor Data Visualization
After visual inspection, we excluded one subject from analysis due to consistently noisy ECG and PPG signals throughout the night, suggesting the sensors had poor contact with the skin during recording. Therefore, the final dataset for population models included nine subjects, amounting to 48.55 total hours of data for training and testing. After removing noisy data, Subject 6 had only 1.73 h available and was removed from the personal model analysis due to the small size of the dataset for training and testing a personal model for this participant. For personal models, a total of 46.82 h of data were available for use.
The ideal scenario for building a successful population model for sleep stage recognition is to have highly similar signal patterns and features within each sleep stage that generalize across individuals. For the subacute stroke population, we considered that each patient could have substantially different physiological signals (i.e., heart rate variability, oxygen saturation, and core-limb temperature fluctuations) due to the heterogeneous nature of stroke and its impact on regulation of the autonomic nervous system. Figure 2 illustrates this between-subject heterogeneity via tSNE graphs. Sleep stages are generally unable to explain different data clusters (Figure 2A), indicating a low probability that data points from the same sleep stage are similar. Rather, many of the tSNE clusters are comprised of data from individual subjects ( Figure 2B). This suggests that, in the current dataset, similarity within subjects supersedes similarity within sleep stages, which would likely create challenges for a population-trained model to generalize to new patients.

Machine Learning
Cohen's kappa scores for the different algorithms (Bagging, Random Forest, Gradient Boosting, and XGBoost) and sleep-stage resolutions (2-, 3-, and 4-stage) are given in Table 3 for both the population and personal model frameworks. As expected, population models generally performed poorly, with the highest Cohen's kappa value of 0.27 provided by Gradient Boosting and 2-stage classification. Personalized models performed better by comparison, with the highest Cohen's kappa value of 0.66 provided by XGBoost and 2-stage classification. The 2-stage classification from these models all outperformed the ActiWatch Autoscoring algorithm for these patients, with a Cohen's kappa value of 0.48 (Table 3). Gradient boosting and XGBoost performed similarly across all metrics. Since both algorithms are methodologically similar, we selected XGBoost as the representative algorithm to evaluate performance in further analyses. Table A2 (Appendix A) shows detailed statistics of the population model using XGBoost algorithm. ActiWatch Autoscore 2 0.477 Italic values indicate the best-performing algorithm within each sleep stage resolution (2-stage, 3-stage, 4-stage) for both population and personalized models. Asterisk (*) indicates the algorithm selected for further analysis, based on its best or near-best performance across model designs. The clusters illustrate that features are more similar within patients than within sleep classes, suggesting that the similarity within each subject is greater than similarities across different sleep stages. This indicates a machine learning algorithm trained on population data may be challenged to learn characteristic patterns of the different sleep stages that would generalize to new patients.

Machine Learning
Cohen's kappa scores for the different algorithms (Bagging, Random Forest, Gradient Boosting, and XGBoost) and sleep-stage resolutions (2-, 3-, and 4-stage) are given in Table 3 for both the population and personal model frameworks. As expected, population models generally performed poorly, with the highest Cohen's kappa value of 0.27 provided by Gradient Boosting and 2-stage classification. Personalized models performed better by comparison, with the highest Cohen's kappa value of 0.66 provided by XGBoost and 2-stage classification. The 2-stage classification from these models all outperformed the ActiWatch Autoscoring algorithm for these patients, with a Cohen's kappa value of 0.48 (Table 3). Gradient boosting and XGBoost performed similarly across all metrics. Since both algorithms are methodologically similar, we selected XGBoost as the representative algorithm to evaluate performance in further analyses. Table A2 (Appendix A) shows detailed statistics of the population model using XGBoost algorithm. Model performance metrics-including sensitivity, precision, F 1 -score, and balanced accuracy-for the best-performing model (XGBoost, personal) are given in Table 4 for each of the three levels of sleep-stage resolution. Macro average F 1 scores were 0.83, 0.76, and 0.66 for the 2-, 3-, and 4-stage models, respectively. Figure 3 shows the confusion matrix of the 4-stage personal model. Average recall of wake, light sleep, deep sleep, and REM sleep were 75.3%, 65.0%, 83.3%, and 70.1%, respectively. Wake, deep sleep, and REM sleep were all most often misclassified as light sleep, while misclassifications of light sleep were similarly distributed across the other stages. Figure 4 shows a representative hypnogram from the best-performing personal model compared to ground-truth sleep stages from PSG. For this patient, the model slightly overestimated the time spent awake (4.9% of the total recording time, versus the actual 3.6% determined from PSG), underestimated light sleep (40.0%, versus actual 42.6%), overestimated deep sleep (32.0%, versus actual 27.8%), and underestimated REM sleep (23.2%, versus actual 26.0%). The model-estimated total sleep time was 357 min, versus 362 min from PSG. The model-estimated number of awakenings was six, versus one from PSG.

Discussion
In this preliminary study, we explored two types of models (population vs. personal), four algorithms (Balanced Bagging, Balanced Random Forest, Gradient Boosting, and XGBoost), and three resolutions of sleep staging (2-, 3-, and 4-stage). This is the first ma-

Discussion
In this preliminary study, we explored two types of models (population vs. personal), four algorithms (Balanced Bagging, Balanced Random Forest, Gradient Boosting, and XGBoost), and three resolutions of sleep staging (2-, 3-, and 4-stage). This is the first ma-

Discussion
In this preliminary study, we explored two types of models (population vs. personal), four algorithms (Balanced Bagging, Balanced Random Forest, Gradient Boosting, and XGBoost), and three resolutions of sleep staging (2-, 3-, and 4-stage). This is the first machine learning-based study to our knowledge which utilizes multimodal physiological data (motion, ECG, PPG, skin temperature) from commercialized wearable sensors to classify sleep stages in a subacute stroke population. This study using low-profile, multimodal wearable sensors provides a critical first step for improving the accuracy, resolution, and feasibility of longitudinal sleep monitoring after stroke. Measuring detailed changes in sleep architecture over time can better help us understand the complex relationship between sleep and stroke recovery. Detecting and intervening for patients with poor sleep quality may improve their rehabilitation outcomes or prevent future recurrence of stroke.
The wearable sensors used in this study offer a less obtrusive option for long-term sleep stage monitoring than PSG for patients with stroke across the inpatient or outpatient care settings. While current research-grade sleep monitors such as ActiWatch are also unobtrusive compared to standard PSG, their low monitoring resolution (2-stage detection of sleep and wake) and accuracy (mean balanced accuracy = 0.72) may not be sufficient to capture overall sleep quality for the subacute stroke population. Indeed, our personal machine learning models with multidimensional sensor data demonstrated improved accuracy in 2-stage monitoring over the ActiWatch for patients with stroke (mean balanced accuracy = 0.83), and this approach enhanced the ability to perform 3-and 4-stage detection with similar performance.
Personal models outperformed a population model, likely due to the small sample size and the heterogeneity in physiological signals between patients. For personal models, XGBoost was the best-performing algorithm with an F 1 score of 0.76 when identifying wake, Non-REM, and REM sleep stages and an F 1 score of 0.83 when identifying wake and sleep. A 4-stage personal model successfully recalled 65.0-83.3% of stages on average. Light sleep was the stage most prone to error, both in being misclassified as other stages or in having other stages misclassified as light sleep. This may be due to the greater prevalence of these samples in the dataset (class imbalance), as well as more nuanced physiological changes during the transitions between wake and deeper stages of sleep. Although the modelestimated total sleep duration and relative composition of sleep stages were generally similar to the ground truth obtained from PSG, misclassifications from the model can skew other metrics of sleep quality based on the estimated sleep architecture (e.g., number of awakenings). This advocates for additional study to improve the sleep-stage classifier.
For real-world implementation, personal models would require clinicians to record at least one night of PSG and wearable sensor data for each new patient. Sleep stages could be obtained from wearable sensors alone on the subsequent nights. Although most IRFs rarely conduct PSG-based sleep studies for the patients, unless ordered by the physician, the continued accumulation of evidence about the relationship between sleep and stroke may encourage PSG recordings to become a more common practice in the future. If so, PSG may be conducted early in the IRF program for patients with stroke, and less intrusive wearable devices could be used for the subsequent nights to facilitate continuous, long-term sleep monitoring. Personal models may be a reserve option for accurate, individualized monitoring due to the heterogeneity of the stroke symptoms that likely affect the physiological data of post-stroke individuals.
Ideally, one would want to create a wearable sensor-based sleep stage detection system without conducting a PSG sleep study on each new patient, since PSG equipment is expensive, uncomfortable for the wearer, and requires significant time and resources to collect and score data. Therefore, a population model is more favorable than a personal model for real-world clinical implementation. We found that the population model performed poorly in this study, which most likely can be explained by the low sample size for model training and the heterogeneity of sensor features between patients (Figure 2). The current study implemented an approach similar to our prior research in sleep classification with healthy adults [12]. In this study, a population model was more effective in classifying 2and 3-stage sleep, detecting wake and sleep with a recall of 74.4% and 90.0%, respectively, and detecting non-REM and REM with recall of 73.3%, 59.0%, and 56.0%, respectively. The disparate performance of population models between our previous and current studies further indicates that sleep models for the stroke population may have different needs than those for healthy controls.
Other researchers with larger sample sizes have also demonstrate greater efficacy of population models for healthy individuals. For example, Zhang and colleagues [26] found an F 1 score of 0.6 that constructed a population, multilayers deep learning model from 39 subjects using heart rate and motion data from wearable sensors. In a separate study, Zhang and colleagues [27] found that a 0.69 Cohen's Kappa score constructed a population, bidirectional long-short term memory model which was trained from 417 subjects of public PSG database and tested on 32 subjects based on heart rate and respiratory rate. Beattie and colleagues [28] found a 0.5 Cohen's Kappa score constructed a population, linear discriminant model with 60 subjects based on heart rate and motion. Together, this research suggests that transfer learning or a larger sample size in healthy individuals would improve the performance of a 4-stage population model. Additional details about wearable and non-wearable methods for non-invasive sleep monitoring can be found in a recent review [29].
However, presence of stroke likely complicates this task even more due to the complex effects of stroke on biophysical outcomes depending on the lesion site, severity, and treatment. For example, subcortical lesions have a more drastic effect on the central ANS than cortical lesions, which in turn would have a different impact on bodily processes such as heart rate, respiration, and temperature regulation [13]. Prior studies have shown features of heart rate variability are used as biomarkers in identifying stroke and predicting the condition and outcome of stroke [30,31]. Additionally, multiple co-morbidities combined with polypharmacy (Tables 1 and A1) would further diversify the physiological measures of individuals undergoing recovery and treatment after stroke.
For data-driven modeling, the predictive power of a model can only extend as far as the existing data boundaries. Classifications outside of the training data boundary may reduce accuracy or generalizability of the machine-learned model. By increasing the heterogeneity and sample size of the training data, we expect the model boundaries would be expanded for more accurate classification. Previous large-scale database studies have seen moderate success in population models for automatic sleep stage detection using physiological signals such as ECG and/or PPG isolated from PSG data in targeted patient groups [32]. For example, Sridhar and colleagues leveraged ECG data from the Sleep Heart Health Study (SHHS; N = 6705) and Multi-Ethnic Study of Atherosclerosis (MESA; N = 1619) to build a deep neural network (DNN). They obtained 77% accuracy and a Cohen's kappa coefficient of~0.66 in a 4-stage resolution model (i.e., Wake, Light, Deep, and REM) [33]. Korkalainen and colleagues collected PSG data from 894 OSA patients and built a DNN consisting of a convolutional network and recurrent network. This model had an accuracy of 80.1% with Cohen's kappa of 0.65 for predicting three sleep stages [23]. Thus, it is possible that with additional training data from patients with subacute stroke, automated sleep staging classification with wearable sensors will be possible. Future work will consider incorporating additional data from healthy individuals and applying transfer learning methods to improve performance for patients with subacute stroke. In absence of a larger database for multimodal wearable sensor data, we have demonstrated that personal models may be a feasible alternative for stroke patients requiring detailed, long-term sleep monitoring.

Limitations
There are several limitations to this study, which should be considered and addressed for future work. First, a small sample size was used to train and test the machine learning models, likely contributing to the poor generalization of the population model. Based on the variable nature of stroke and treatments (e.g., medication) on patient biometrics, the optimal size and composition of training data for a generalized population model remains unknown. Future work may consider developing separate models for subsets of the stroke population, such as by stroke type or level of impairment, or incorporating data from healthy individuals to supplement the model training.
Second, all data were collected during a single night of recording. Multiple nights of collection will be a critical extension of this study to validate personal models (i.e., training an algorithm from one night of sensor data and testing on another). Although the 10-fold cross-validation method for personal models, necessitated by the single night of data, may potentially limit the implication of our results, it is an efficient way to pilot the feasibility of collecting wearable sensor data from a subacute stroke population and detecting sleep stage from a machine learning model.
Third, we excluded patients with clinically diagnosed sleep disorders. This exclusion was intended to minimize the potentially confounding impact of these disorders on the model in this preliminary study, since sleep disorders can additionally affect physiological signals beyond the already-variable effects of stroke (e.g., altered movement during REM sleep behavior disorders [34], altered patterns in heart rate variability, respiratory rate, and blood oxygenation for sleep apnea [35,36]). However, sleep disorders are common in the stroke population, with an estimated 50-70% of patients experiencing sleep apnea alone [37]. Thus, it will be essential to include patients with sleep disorders in future model training and validation efforts.

Conclusions
In this study, we established feasibility to construct machine-learning models for sleep monitoring in a subacute stroke population using data from two wireless, wearable sensors. We examined both population and personal models for the supervised classification of sleep stages. The heterogeneity of biophysical signals after stroke will pose a challenge in building a population model that generalizes across individuals, likely requiring significantly more training data. Personal models were a feasible alternative for a small sample size, demonstrating fair accuracy in distinguishing wake, non-REM, and REM sleep. Future work will sample additional data across and between patients with subacute stroke for robust model training and refinement.
Conflicts of Interest: J.Y.L. and S.X. hold equity in the company Sonica Health, Inc. that makes wearable sensors for medical applications. The remaining authors declare no competing interests. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.