1. Introduction
Personal health monitoring devices have rapidly gained popularity in the general public, providing the potential to improve feedback and motivation [
1,
2] towards improvements in physical activity, dietary intake, and sleep. These commercially available products have begun to overcome limitations in cost (as low as $60) and usability, and have been validated several times in the area of physical activity [
3,
4,
5,
6]. Yet whether devices are sufficiently accurate in the sleep domain is not well understood.
Polysomnography (PSG) has long been the “gold standard” for measuring sleep. PSG provides general sleep measures, such as total sleep time (TST) and sleep efficiency (SE), in addition to providing measures of specific sleep stages. However, despite clear benefits, PSG is costly, arduous to apply, and can be intrusive to sleep itself, making the search for alternatives essential to the field [
7].
Actigraphy has been validated for general measures of sleep (e.g., [
8]) and has proven valuable for clinical and research use as it is relatively inexpensive, non-intrusive, and does not require a sleep technician for application. However, a limitation of the validated research-based actigraphy is that it relies on hand-scored data using participant diaries of events (bedtime/wake time and time in bed; [
8,
9]). Moreover, reliability of sleep staging generated by commercially available devices has received little attention.
Here we sought to determine the validity of sleep measures—TST, SE, light sleep time, and deep sleep time—from four commercially-available personal health monitoring devices compared to PSG. For direct comparison to these devices, we also assessed the validity of TST and SE measures from a research-validated actigraph compared to PSG. Such comparisons are essential for understanding the role of these measures for personal health monitoring and for clinical consideration.
2. Materials
2.1. Actigraphy
The Basis Health Tracker (2014 edition; Intel Corp, Santa Clara, CA, USA) is a wristwatch with an embedded actigraph and automatic sleep detection. Data were uploaded to the user website which generated measures of sleep (see
Table 1).
The Fitbit Flex (Fitbit Inc., San Francisco, CA, USA) is a wristband with an embedded actigraph. Sleep-tracking mode is initiated by repeatedly tapping on the band for 1–2 s until two dimming lights appear on the device’s display. The same tapping pattern is used to stop sleep-tracking, at which point the display flashes five LEDs to signal “wake mode.” Data from the Fitbit were uploaded to the user website which generated sleep measures.
The Misfit Shine (Misfit Wearables, San Francisco, CA, USA) was worn with the provided wrist strap. Sleep-tracking mode was set to automatic. Data were uploaded to the device’s mobile application via Bluetooth and measures were extracted from the application.
The Withings Pulse O2 (Withings, Issy-les-Moulineaux, France) was worn on the wrist with the supplied wrist strap. Sleep tracking is manually activated by swiping the finger across the face of the device and deactivated in the same way. Recorded data were uploaded via Bluetooth to the mobile application.
The Actiwatch Spectrum (Philips Respironics, Bend, OR, USA) is a wristwatch with embedded accelerometer and off-wrist detection. The Actiwatch was set to record the mean activity in 15-s epochs. Participants were instructed to press an event-marker button to denote bedtime and wake time.
2.2. Polysomnography
Polysomnography was recorded with an Aura PSG ambulatory system (Natus Neurology, West Warwick, RI, USA). The montage included six EEG leads (O1, O2, C3, C4, F3, F4, Cz), two EOG leads (one on the side of each eye), two chin EMG leads, two mastoid electrodes, and one ground electrode on the forehead.
5. Discussion
This study was designed to validate and compare sleep measures (TST, SE, Light sleep, and Deep sleep) recorded from several commercially available, wrist-worn personal health monitoring devices against PSG, the “gold standard” for sleep monitoring. Overall, we found specific categories of device data did not differ from PSG measures (summarized in
Table 6), yet many devices provided unusable data.
With and without the exclusion of data points exhibiting gross mis-estimation, TST measures from all devices were did not differ from PSG measures. On the other hand, only the Actiwatch and Fitbit devices provided SE measures that did not differ from PSG measured SE (via WSR). Further, only the Actiwatch correlated with PSG for SE, albeit weakly. The Basis, Misfit, and Withings reported Light and Deep sleep measures. However, Light sleep measures were considered distinct from that of PSG based on WSR tests, and only estimates of Light sleep from the Misfit and Withings moderately correlated with PSG measures of light sleep (nREM1 + nREM2). The Basis measure of Deep sleep was not different from that of PSG, and the Withings estimate of Deep sleep was the only to correlate with PSG (SWS + REM). Thus, several devices were able to accurately assess TST and SE, yet no devices provided reliable staging data.
The high reliability of the Actiwatch Spectrum for estimating TST is consistent with other studies [
12]. Based on such evidence, the Actiwatch is widely used both in research (e.g., [
13,
14] and clinical (see [
15]) settings. However, to achieve this reliability, participants must note bed and wake times in a sleep diary and trained researchers must score the data, which is labor intensive and fails to provide real-time feedback. In the absence of this, the Fitbit and Withings obtained TST measures that did not differ significantly from that of PSG, contrary to others’ findings ([
16,
17,
18]). The Fitbit and Basis also had SE measures that did not differ from PSG and were within the same narrow range of bias as the Actiwatch, yet correlations with PSG were not significant for these devices. Therefore, the recommendation of these devices for research purposes may be premature.
Two devices, the Fitbit and the Withings, required user input to initiate “sleep mode,” which may both impede and facilitate data validity. For example, for the Fitbit, measures of SE and TST did not differ significantly from PSG. However, sleep mode initiation may have been an obstacle to measurement, as Fitbit data for 9 participants were lost. These findings are consistent with a recent longitudinal investigation that identified very high data loss for this device [
19]. Yet it is important to note that user-input
per se was not the issue, as the Withings exhibited fewer lost data points than the Fitbit (3 as opposed to 9). We presume this difference occurred because the Withings device clearly confirms the user is in sleep mode (
i.e., “Goodnight” flashing on the screen), whereas the Fitbit notifies the user of sleep mode less clearly (
i.e., a series of vibrations and lights signals both “sleep mode” and “wake mode”). Therefore, the mode of input for the Fitbit could induce data loss, which minimizes reliability of the device. Devices that did not require user input, the Basis and the Misfit (as we used the automatic mode in this study), performed less satisfactory. Both the Basis and Misfit were not unreliable in their measure of TST, however the range of bias was large (approximately ±75 min). The Basis was similarly variable for measures of SE relative to PSG.
A unique aspect of this study was the assessment of the validity of measures of Light and Deep sleep, a popular output feature of commercially available devices. Light and Deep sleep measures from all tested devices, the Misfit, Basis, and Withings, were not comparable to PSG. Notably, both Light and Deep sleep measures from the Withings correlated with the respective measures from PSG. However, Light sleep was underestimated and Deep sleep was significantly overestimated. The Misfit measure of Light sleep weakly correlated with that from PSG however it was significantly underestimated by over an hour (average bias of 79 min) while Deep sleep was significantly overestimated to an even greater extent (average bias of 107 min). Measures of Light and Deep sleep from the Basis showed little evidence of reliability.
Reliability of Light and Deep sleep measures was based on the assumption that Light sleep refers to nREM1 and nREM2 and Deep sleep refers to SWS and REM. Light sleep is typically defined as the combination of nREM1 and nREM2 in the sleep literature (e.g., [
20,
21,
22]). However, Deep sleep is typically defined as nREM3 and nREM4 (
i.e., SWS) with REM a distinct measure. Algorithmically, devices are likely to collapse SWS and REM, as these stages are similarly characterized by low movement. This assumption is supported by the available literature on similar devices (e.g., [
23]). Nonetheless, companies making these devices may have a different underlying assumption of what Light and Deep sleep measures capture compared to PSG.
In addition to our assumptions regarding Light and Deep sleep, there are other possible limitations to this study. First, there was little variation in the measure of SE across the young adult population studied here which may explain the lack of correlation between Basis, Misfit, Fitbit, and Withings measures of SE with PSG measured SE. Second, we excluded data for devices that recorded a TST >2 h from the TST from PSG (discussed further below). Given that there is no clear way to identify devices with gross failure, we chose a threshold cutoff that could be consistently applied across all devices. Most commonly excluded was data from the Misfit, which may have resulted in overestimation of the strength of this device. Third, most devices generate global measures and epoch-by-epoch data was not available particularly at the appropriate temporal resolution. As such, we could not conduct epoch-by-epoch analyses of sensitivity and specificity (see [
24]).
We chose to perform comparative analyses despite high data loss. In the adjusted analyses, we also chose to exclude data points deemed as a gross failure based on TST, the measure found to be most reliable across devices. We were interested in comparing devices without the failed devices given that we speculate instances of gross failure are indistinguishable from other device errors. For example, use of the Misfit device resulted in high data loss due to mis-estimation. Although the cause of these errors is unknown, they may be either human-related (e.g., the watch was not tight enough to the wrist) or hardware-related (e.g., a glitch in the recording system), and thus these errors were grouped with more blatant data loss instances (e.g., user-initiation errors). Although the removal of such data points may bias results in favor of the devices, we thought it important to examine data validity when “failure” points are overcome. In that way, the true validity of the devices could be determined. Nevertheless, comparative analyses should be interpreted with the knowledge that errors are a frequent impediment to data acquisition.
Collectively, these data suggest that the value of commercially available devices for measurement of sleep depends on the measure of interest and application. Total sleep time, and in some cases, sleep efficiency, can be monitored by wrist-worn, commercially available devices, yet the reliability of these devices remains low. These devices do not yet yield sufficient information for accurate sleep staging, even on a superficial level (e.g., Light vs. Deep). Therefore, research focusing on habitual total sleep time could utilize some of these devices, while work focusing on sleep efficiency or staging, as well as clinical applications such as detection of apneic events, should continue to rely on PSG. Given the continuing advancement of sleep-detecting algorithms and measurement techniques, it is not unrealistic to believe that more complete and commercially available sleep monitoring systems will be available in the near future.