SimultaneousBench-Based Metrological Characterization of Smartwatches’ Accelerometers for Accurate Measurement
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe manuscript addresses a relevant topic for Technologies: the calibration and comparability of smartwatch accelerometers for movement-disorder monitoring. The proposed use of a seismic table for simultaneous calibration of multiple devices is practically valuable, especially for large-scale wearable-sensor studies. The experimental setup is generally well motivated, and the manuscript provides a reasonable methodological framework based on reference acceleration measurements and uncertainty analysis. However, the current version does not yet provide enough evidence to support its main claims. The authors are suggested to check the following comments.
1. The results should be expanded to support the key claim of simultaneous multi-device calibration. The manuscript presents frequency-response curves for five smartwatches, but the detailed indication error and uncertainty analysis is shown only for smartwatch 2, which is described as the device with the lowest error. This selection may bias the interpretation. The authors should provide error values and expanded uncertainty ranges for all tested smartwatches.
2. The discrepancy between the planned 15 test conditions and the 12 actually measured conditions must be handled more clearly. The Methods section defines 15 frequency-amplitude combinations, but the Results section later states that three points could not be measured because of seismic-table limitations.
3. The uncertainty analysis requires clarification. The manuscript states that the selected smartwatch remains within the 6% validity range, but also states that the uncertainty interval exceeds the 6% error range. This raises an important interpretation issue: if the expanded uncertainty interval crosses the acceptance threshold, the conclusion that the device satisfies the criterion is not fully supported. The authors should clearly define the acceptance rule and explain how uncertainty is incorporated into the pass/fail judgment.
4. The claim that calibration along each accelerometer axis is similar needs evidence. The study focuses on the z axis, but the manuscript discusses triaxial accelerometers and movement-disorder applications. The authors should either provide results for all three axes or substantially limit the claims to the evaluated axis. At minimum, representative data or summary statistics for the other axes should be included.
5. Several presentation and editorial issues should be corrected before publication. Table 1 reports smartwatch weight in kg, although the values appear to be in grams. The manuscript also contains several language errors, such as “for a example,” “acceleremoter,” and “with all the smartwatch,” which should be corrected through careful proofreading.
Author Response
The manuscript addresses a relevant topic for Technologies: the calibration and comparability of smartwatch accelerometers for movement-disorder monitoring. The proposed use of a seismic table for simultaneous calibration of multiple devices is practically valuable, especially for large-scale wearable-sensor studies. The experimental setup is generally well motivated, and the manuscript provides a reasonable methodological framework based on reference acceleration measurements and uncertainty analysis. However, the current version does not yet provide enough evidence to support its main claims. The authors are suggested to check the following comments.
- The results should be expanded to support the key claim of simultaneous multi-device calibration. The manuscript presents frequency-response curves for five smartwatches, but the detailed indication error and uncertainty analysis is shown only for smartwatch 2, which is described as the device with the lowest error. This selection may bias the interpretation. The authors should provide error values and expanded uncertainty ranges for all tested smartwatches.
Thank you for this comment. We agree that presenting the indication error and uncertainty analysis for only one smartwatch was insufficient to support the key claim of simultaneous multi-device calibration and could introduce a selection bias in the interpretation of the results. To address this issue, the indication error and expanded uncertainty analyses for all five evaluated smartwatches, previously provided as supplementary material, have now been incorporated into the main Results section. Specifically, Figs. 6 to 10 present the indication error together with the corresponding expanded uncertainty intervals for all evaluated frequency–amplitude combinations and for each smartwatch individually. In addition, descriptive text discussing the observed behavior and measurement performance of each device has been included. These additions provide a more complete and balanced characterization of the measurement performance across all tested devices, thereby strengthening the evidence supporting the proposed calibration methodology.
- The discrepancy between the planned 15 test conditions and the 12 actually measured conditions must be handled more clearly. The Methods section defines 15 frequency-amplitude combinations, but the Results section later states that three points could not be measured because of seismic-table limitations.
Thank you for this observation. We agree that the discrepancy between the planned and completed test conditions required clearer handling throughout the manuscript. The three conditions that could not be completed (1 Hz at 2.5 m/s², 1 Hz at 4 m/s², and 2 Hz at 4 m/s²) were excluded due to mechanical limitations of the seismic table, which produced inconsistent or non-linear excitation at low frequencies combined with high amplitudes. We recognize that this limitation was not sufficiently explained in the original manuscript, and that its implications for the clinical validity of the methodology, particularly in the context of movement disorder monitoring, deserved further discussion. Moreover, we acknowledge that the practical implementation of this calibration framework for specific motor disorders requires careful selection of an excitation device capable of reliably covering the full amplitude and frequency range characteristic of the target condition. To address all these points, the following changes have been made to the manuscript.
To address this, we have added the following text to the Methods section, in lines 216-220, and completed with Table 1.
“[…] It should be noted that, due to mechanical limitations of the seismic table at low frequencies combined with high amplitudes, three of the fifteen planned conditions could not be completed: 1 Hz at 2.5 m/s², 1 Hz at 4 m/s², and 2 Hz at 4 m/s². At these conditions, the table produced inconsistent or non-linear excitation. The experimental validation was therefore carried out over 12 amplitude–frequency combinations, as detailed in Table 1. […]”
Also, we have added the following text in the Results section, in lines 401-403:
“[…] As shown, 12 of the 15 planned amplitude–frequency combinations are presented, as the three conditions that could not be completed due to seismic table limitations were described in Section 2.2. […]”
Finally, the Discussion section, in lines 476-487, has been expanded to analyze the implications of these excluded conditions for movement-disorder monitoring applications and to clarify the limitations of the current excitation system. The revised discussion also highlights that, despite these exclusions, the remaining experimental conditions still cover the principal frequency range associated with several clinically relevant movement disorders thereby supporting the validity of the proposed methodology as a general simultaneous calibration framework.
“[…] The three excluded amplitude–frequency combinations (1 Hz at 2.5 m/s², 1 Hz at 4 m/s², and 2 Hz at 4 m/s²) correspond to low-frequency, high-amplitude conditions that may be clinically relevant in certain movement disorder contexts. While the proposed methodology is intended as a general simultaneous calibration framework, its practical implementation for specific motor disorders requires careful consideration of the characteristic frequency and amplitude ranges associated with each condition. For instance, tremor in Parkinson's disease typically occurs in the 3–7 Hz range, whereas other movement disorders such as essential tremor or dystonia may present distinct kinematic profiles. When deploying this calibration approach for a particular clinical application, it is therefore recommended to select an excitation device capable of reliably covering the full amplitude and frequency range characteristic of the target disorder, avoiding the mechanical limitations observed here at low-frequency, high-amplitude combinations. Future work should explore alternative vibration platforms with extended linear operating ranges to ensure complete validation coverage across all clinically meaningful conditions. […]”
- The uncertainty analysis requires clarification. The manuscript states that the selected smartwatch remains within the 6% validity range, but also states that the uncertainty interval exceeds the 6% error range. This raises an important interpretation issue: if the expanded uncertainty interval crosses the acceptance threshold, the conclusion that the device satisfies the criterion is not fully supported. The authors should clearly define the acceptance rule and explain how uncertainty is incorporated into the pass/fail judgment.
Thank you for this important comment. We fully agree that the relationship between the indication error and the expanded uncertainty requires careful clarification. In the original manuscript, the 6% threshold borrowed from ISO 8041-1:2017 was used as an indicative engineering reference in the absence of a dedicated normative framework for wearable accelerometers in clinical applications. We also agree that if the expanded uncertainty interval overlaps or exceeds the reference threshold, a strict conformity or pass/fail interpretation cannot be rigorously supported without a formally defined decision rule and an application-specific normative framework. For this reason, the revised manuscript no longer interprets the 6% threshold as a formal pass/fail conformity assessment criterion. Instead, the indication error and expanded uncertainty are reported together as part of a comparative metrological characterization of the evaluated devices. To clarify this point, the following text has been added to the Results section, in lines 420-424 and in the Discussion Section, in lines 536-550:
“[…] To provide a reference framework for evaluating measurement error, the 6% tolerance threshold defined in ISO 8041-1:2017 for general-purpose vibration meters was adopted. Although this standard is not directly applicable to healthcare contexts, it was used here solely as a reference benchmark in the absence of a specific normative framework for wearable accelerometers in clinical applications. […]”
“[…] A broader limitation of the current evaluation concerns the absence of a dedicated normative framework for the metrological validation of wearable accelerometers in healthcare applications. The 6% tolerance threshold adopted in this study is borrowed from ISO 8041-1:2017, a standard designed for general-purpose vibration meters used in human vibration response assessment, which is not directly applicable to clinical or research wearable contexts. This threshold was therefore used solely as an indicative engineering reference to provide a basis for evaluating measurement error. The fact that expanded uncertainty intervals exceed this boundary at higher excitation amplitudes does not necessarily imply device failure but rather reflects the inherent variability of consumer-grade sensors and the limitations of applying an engineering standard outside its intended scope. The authors wish to emphasize the need to develop specific metrological standards for portable accelerometers in clinical and research settings, as this remains a significant unmet need in this field. Such standards would need to define not only tolerance thresholds appropriate for specific clinical applications, but also reproducibility requirements, wearing condition specifications, and uncertainty reporting guidelines tailored to healthcare contexts. […]”
- The claim that calibration along each accelerometer axis is similar needs evidence. The study focuses on the z axis, but the manuscript discusses triaxial accelerometers and movement-disorder applications. The authors should either provide results for all three axes or substantially limit the claims to the evaluated axis. At minimum, representative data or summary statistics for the other axes should be included.
Thank you for your comment. We agree that the original manuscript did not sufficiently limit its claims to the evaluated axis, and that references to triaxial capabilities required either supporting evidence or more careful scoping. No experimental characterization data for the x- and y-axes were available within the present study. Therefore, rather than extrapolating unsupported conclusions, we revised the manuscript to explicitly limit the experimental validation to the z-axis, which received the direct excitation from the seismic table. In response to this comment, we have taken the following actions.
First, the term 'triaxial' has been removed from the abstract, ensuring that the scope of the experimental validation is accurately represented from the outset. Second, the claims regarding the extension of the methodology to the remaining axes have been reframed throughout the manuscript as a methodological possibility rather than a validated result, making clear that the present study is restricted to the z-axis. The efficiency advantages of simultaneous calibration for triaxial characterization are discussed as a theoretical projection, not as an empirically demonstrated outcome.
To address this, the following text has been added to the Methods section, in lines 281-287:
“[…] The present analysis focuses on the z-axis, which received the primary excitation from the seismic table. Since most consumer-grade smartwatches incorporate triaxial accelerometers, the proposed methodology could be straightforwardly extended to the remaining axes, which would further amplify the time-efficiency advantages of simultaneous calibration, whereas traditional individual calibration would require 135 tests per device per full triaxial characterization, the proposed approach would maintain a fixed total test count regardless of the number of devices evaluated. […]”
The following clarification has been added to the Results section, in lines 387-389:
“[…] The experimental evaluation focuses on the z-axis, which received the direct excitation from the seismic table. […]”
Finally, the following text has been added to the Discussion section, in lines 520-527:
“[…] In this work, the analysis was restricted to the z-axis, which received the direct excitation from the seismic table. Although most consumer-grade smartwatches incorporate triaxial accelerometers, the calibration of additional axes could be implemented using the same experimental framework. In this case, the efficiency gains of simultaneous calibration would be even more pronounced, as the total number of tests remains constant regardless of the number of devices, making the approach particularly attractive for large-scale triaxial characterization. Because of this, future work should explore the effects and applicability of the proposed methodology for different axis. […]”
- Several presentation and editorial issues should be corrected before publication. Table 1 reports smartwatch weight in kg, although the values appear to be in grams. The manuscript also contains several language errors, such as “for a example,” “acceleremoter,” and “with all the smartwatch,” which should be corrected through careful proofreading.
Thank you for pointing out these presentations and editorial issues. The manuscript has been carefully proofread and revised to correct the identified errors. In particular, the unit reported for smartwatch weight in Table 1 has been corrected from kilograms to grams. Additionally, the language inconsistencies and typographical errors identified by the reviewer, including expressions such as “for a example,” “acceleremoter,” and “with all the smartwatch,” have been corrected throughout the manuscript. A general linguistic and editorial revision was also performed to improve clarity and readability.
Author Response File:
Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThis paper proposes a methodology for simultaneous calibration of accelerometers embedded in multiple consumer-grade smartwatches, using a seismic table and comparison against a reference accelerometer following ISO 16063-21. Five Wear OS devices are mounted on a shared 45° inclined fixture to achieve uniaxial sinusoidal excitation at 12 amplitude–frequency combinations (1–8 Hz, 1–4 m/s²). Data processing applies bandpass filtering and RMS computation; indication errors and expanded uncertainties are evaluated according to ISO-GUM. Results for the best-performing smartwatch remain within the ±6% threshold mandated by ISO 8041-1, though uncertainty intervals widen with amplitude. The authors emphasize the time efficiency of bulk calibration over individual-device procedures, positioning the methodology as a scalable screening tool for large-scale healthcare studies in movement disorders. Several suggestions are supplied:
- Suggest the auhtors enhance discuss whether excluding three planned points biases the validation, especially for tremor-relevant high-amplitude scenarios.
- Suggest the auhtors improve this part,the Dytran 3023M3’s own calibration date, uncertainty, and traceability certificate seems are not stated.
- Suggest the auhtors enhance quantify the noise floor and discuss its influence on the 1 m/s² measurements where SNR is lowest.
- Suggest the auhtors enhance discuss the implications for capturing intermittent sensor anomalies relevant in clinical movement monitoring.
- Sugegst the auhtors supply more detail about the potential in the conlcosuion part.
- Suggest the auhtors enhane the introdcuion part with wearable devices such as wearable photonic smart wristband for cardiorespiratory function assessment and biometric identification etc.
Author Response
This paper proposes a methodology for simultaneous calibration of accelerometers embedded in multiple consumer-grade smartwatches, using a seismic table and comparison against a reference accelerometer following ISO 16063-21. Five Wear OS devices are mounted on a shared 45° inclined fixture to achieve uniaxial sinusoidal excitation at 12 amplitude–frequency combinations (1–8 Hz, 1–4 m/s²). Data processing applies bandpass filtering and RMS computation; indication errors and expanded uncertainties are evaluated according to ISO-GUM. Results for the best-performing smartwatch remain within the ±6% threshold mandated by ISO 8041-1, though uncertainty intervals widen with amplitude. The authors emphasize the time efficiency of bulk calibration over individual-device procedures, positioning the methodology as a scalable screening tool for large-scale healthcare studies in movement disorders. Several suggestions are supplied:
- Suggest the auhtors enhance discuss whether excluding three planned points biases the validation, especially for tremor-relevant high-amplitude scenarios.
Thank you for this observation. We agree that the discrepancy between the planned and completed test conditions required clearer handling throughout the manuscript. The three conditions that could not be completed (1 Hz at 2.5 m/s², 1 Hz at 4 m/s², and 2 Hz at 4 m/s²) were excluded due to mechanical limitations of the seismic table, which produced inconsistent or non-linear excitation at low frequencies combined with high amplitudes. We recognize that this limitation was not sufficiently explained in the original manuscript, and that its implications for the clinical validity of the methodology, particularly in the context of movement disorder monitoring, deserved further discussion. Moreover, we acknowledge that the practical implementation of this calibration framework for specific motor disorders requires careful selection of an excitation device capable of reliably covering the full amplitude and frequency range characteristic of the target condition. To address all these points, the following changes have been made to the manuscript.
To address this, we have added the following text to the Methods section, in lines 216-220, and completed with Table 1.
“[…] It should be noted that, due to mechanical limitations of the seismic table at low frequencies combined with high amplitudes, three of the fifteen planned conditions could not be completed: 1 Hz at 2.5 m/s², 1 Hz at 4 m/s², and 2 Hz at 4 m/s². At these conditions, the table produced inconsistent or non-linear excitation. The experimental validation was therefore carried out over 12 amplitude–frequency combinations, as detailed in Table 1. […]”
Also, we have added the following text in the Results section, in lines 401-403, and in the Discussion section, in lines 476-487:
“[…] As shown, 12 of the 15 planned amplitude–frequency combinations are presented, as the three conditions that could not be completed due to seismic table limitations were described in Section 2.2. […]”
“[…] The three excluded amplitude–frequency combinations (1 Hz at 2.5 m/s², 1 Hz at 4 m/s², and 2 Hz at 4 m/s²) correspond to low-frequency, high-amplitude conditions that may be clinically relevant in certain movement disorder contexts. While the proposed methodology is intended as a general simultaneous calibration framework, its practical implementation for specific motor disorders requires careful consideration of the characteristic frequency and amplitude ranges associated with each condition. For instance, tremor in Parkinson's disease typically occurs in the 3–7 Hz range, whereas other movement disorders such as essential tremor or dystonia may present distinct kinematic profiles. When deploying this calibration approach for a particular clinical application, it is therefore recommended to select an excitation device capable of reliably covering the full amplitude and frequency range characteristic of the target disorder, avoiding the mechanical limitations observed here at low-frequency, high-amplitude combinations. Future work should explore alternative vibration platforms with extended linear operating ranges to ensure complete validation coverage across all clinically meaningful conditions. […]”
- Suggest the auhtors improve this part,the Dytran 3023M3’s own calibration date, uncertainty, and traceability certificate seems are not stated.
Thank you for this valuable observation. We agree that the metrological characterization of the reference accelerometer is a fundamental aspect of any calibration procedure, and that the original manuscript did not provide sufficient detail regarding the calibration status, traceability, and associated uncertainty of the Dytran 3023M3 reference accelerometer. To address this point, we have expanded Section 2.1 to include the following text in lines 256-267:
"[…] The reference accelerometer and analyzer were calibrated by an accredited calibration laboratory prior to the experimental campaign (March 2025). The corresponding calibration certificate provides the calibration date, traceability to national metrological standards, the associated measurement uncertainty, and the results of calibration tests carried out at different excitation frequencies and acceleration levels. Specifically, calibration was performed at selected frequency points within the sensor's operating range (25, 160, and 2,500 Hz), with controlled reference acceleration amplitudes applied at each frequency to characterize the accelerometer's response. For each calibration point, the certificate records the instrument's indication values, the mean value, the indication error relative to the reference, and the expanded uncertainty. These calibration data were incorporated into the uncertainty balance of the proposed methodology and taken into account throughout the uncertainty analysis of the overall measurement process. […]"
- Suggest the auhtors enhance quantify the noise floor and discuss its influence on the 1 m/s² measurements where SNR is lowest.
Thank you for this valuable suggestion. We agree that the influence of the sensor noise floor is particularly relevant for the interpretation of the measurements performed at the lowest excitation amplitude (1 m/s²), where the signal-to-noise ratio (SNR) is expected to be less favorable. In the original manuscript, this aspect was not sufficiently discussed. To address this point, we have expanded the Discussion section to explicitly analyze the likely influence of the noise floor on the increased variability observed at low excitation amplitudes. The following text has been added in lines 488-498:
“[…] It is also worth noting that the higher dispersion observed across smartwatch measurements at 1 m/s² excitation levels is likely influenced by the signal-to-noise ratio (SNR) characteristics of consumer-grade accelerometers. At this amplitude, the excitation signal is closest to the noise floor of the embedded sensors, meaning that background noise contributes more significantly to the recorded values relative to the actual excitation. This effect is consistent with the results presented in Figures 6-10, where the 1 m/s² condition systematically exhibits greater variability and a higher number of out-of-range points compared to the 2.5 and 4 m/s² conditions across all five smartwatch models. A formal characterization of the noise floor for each smartwatch model would provide a more rigorous basis for interpreting low-amplitude measurements and should be considered in future implementations of this calibration protocol. […]”
- Suggest the auhtors enhance discuss the implications for capturing intermittent sensor anomalies relevant in clinical movement monitoring.
Thank you for raising this important point. We agree that, in clinical movement-monitoring applications, accelerometers may be exposed to intermittent anomalies such as transient motion artefacts, short-duration disturbances, or temporary signal instabilities. Under these conditions, accurate calibration and rigorous uncertainty characterization are particularly important to ensure reliable sensor response, especially when detecting low-amplitude or rapidly varying motion events. To address this point, we have expanded the Discussion section with the following text in lines 563-570:
"[…] In clinical movement-monitoring applications, accelerometers may be exposed to intermittent anomalies such as transient motion artefacts, short-duration disturbances, or temporary signal instabilities. Under these conditions, accurate calibration and rigorous uncertainty characterization are particularly important to ensure reliable sensor response, especially when detecting low-amplitude or rapidly varying motion events. The proposed methodology contributes to improving the metrological reliability of wearable accelerometer systems intended for use in such environments, by providing a systematic framework for quantifying and controlling measurement uncertainty prior to deployment. […]"
- Sugegst the auhtors supply more detail about the potential in the conlcosuion part.
Thank you for this valuable suggestion. We agree that the conclusion section benefits from a more explicit discussion of the broader implications and potential impact of the proposed calibration framework. In the revised manuscript, the original concluding content has therefore been divided between the Discussion and Conclusions sections to improve clarity and better separate interpretative insights from final take-home messages.
Specifically, we have expanded the Discussion section to include a more detailed consideration of the clinical, technical, and ethical relevance of our work. This includes emphasizing that, beyond the technical contribution, precise characterization of wearable accelerometers may support more accurate assessment of symptom severity and enable more personalized treatment strategies, ultimately contributing to improved patient outcomes and quality of life. We also highlight that the proposed simultaneous calibration approach improves efficiency by reducing testing time, energy consumption, and associated costs, thereby facilitating more scalable deployment in both research and clinical environments.
Furthermore, the Discussion now addresses the ethical dimension of the work, stressing that improving measurement accuracy helps ensure that clinical decisions are based on reliable sensor data rather than potentially misleading signals, supporting more equitable and objective patient care. The growing importance of validated and scalable calibration frameworks is also highlighted in the context of the rapid expansion of wearable technologies for continuous and remote health monitoring, where robust metrological traceability is essential to bridge the gap between consumer-grade devices and clinical-grade requirements.
The Conclusions section has been correspondingly streamlined to provide a more concise synthesis of the main findings and implications of the study.
- Suggest the auhtors enhane the introdcuion part with wearable devices such as wearable photonic smart wristband for cardiorespiratory function assessment and biometric identification etc.
Thank you for this suggestion. We agree that broadening the scope of wearable technologies presented in the introduction enriches the context of the work. Accordingly, we have added the following text to the introduction, in lines 87-89:
“[…] In [20], a wearable photonic smart wristband is proposed for cardiorespiratory function assessment and biometric identification through continuous monitoring of pulse wave signals. […]”
Author Response File:
Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsThis manuscript proposes a method for simultaneously evaluating accelerometers embedded in multiple consumer-grade smartwatches on the same vibration table. The topic is relevant, particularly for large-scale studies that use multiple wearable devices. However, the manuscript does not adequately distinguish between bench-based physical accuracy testing and clinical validity for movement disorder monitoring. The abstract and conclusions make strong claims about applications to Parkinson’s disease, movement disorders, and healthcare, but the study does not include patients, human wearing conditions, free-living settings, clinical symptoms, or disease severity. The manuscript should therefore be reframed as a methodological study of simultaneous bench-based evaluation of smartwatch accelerometers. The following major issues should be addressed.
- Clarify the study objective and align the conclusions with the data.
The data is limited to controlled vibration-table experiments. However, the abstract and conclusions imply relevance to movement disorder monitoring and healthcare applications. The authors should define the main objective as the development or demonstration of a simultaneous bench-based evaluation method for smartwatch accelerometers. Any clinical implications should be stated cautiously as potential future applications. - Distinguish “calibration” from “accuracy assessment.”
The manuscript should clarify whether the procedure is intended as comparison calibration, metrological characterization, or post hoc correction. If the term calibration is retained, the authors should specify that no device-specific correction equations were derived or applied, unless such correction procedures are added. - Provide a clearer rationale for device selection.
The criteria for selecting the five Wear OS smartwatches are insufficiently described. The authors should report the selection criteria, exclusion criteria, operating system version, firmware version, sensor API, actual sampling rate, and missing data rate. Because smartwatch performance depends not only on the accelerometer but also on the operating system, firmware, application, and power management, these details are essential for assessing generalizability. - Clarify whether only one unit per model was tested.
The manuscript does not clearly state how many units were tested for each smartwatch model. If only one unit per model was tested, the results reflect the performance of specific individual devices, not the performance of each model as a whole. This limitation should be clearly stated. Ideally, the authors should test multiple units of the same model and quantify within-model variability. - Verify the assumption of identical input conditions during simultaneous testing.
Placing multiple devices on the same vibration table does not necessarily ensure that each device receives the same acceleration input. Local vibration differences, phase differences, rotational components, device position, and mounting conditions may affect the results. The authors should evaluate and report the spatial uniformity of the vibration input and describe the device placement and mounting procedure in detail. - Justify the selected frequency and acceleration ranges for movement disorder monitoring.
The manuscript states that the tested range of 1–8 Hz and 1–4 m/s² is relevant to Parkinson’s disease and movement disorders. However, the target symptom is not clearly specified. Tremor, gait, dyskinesia, freezing of gait, and upper-limb activity require different frequency ranges, acceleration ranges, and outcome metrics. The authors should explain which clinical application is being targeted and justify the selected test conditions accordingly. - Address the fact that only 12 of the planned 15 test conditions were completed.
The Methods describe 15 planned combinations of frequency and amplitude, but the Results report that three conditions could not be tested because of limitations of the vibration table. This is an important limitation of the proposed method. The authors should clearly distinguish planned and completed test conditions, provide a table of untested conditions, and discuss how the inability to test low-frequency/high-amplitude conditions affects the applicability of the method. - Do not generalize from the z-axis alone to the full triaxial accelerometer.
The main analysis focuses on the z-axis. However, smartwatches are worn on the wrist, and the device coordinate system does not remain fixed relative to the body during daily activities. For movement disorder monitoring, all three axes, cross-axis sensitivity, and vector magnitude may be important. The authors should either provide results for all axes or clearly state that the study evaluates only one axis. - Acknowledge that RMS error alone does not establish validity of clinical metrics.
The study evaluates RMS acceleration error, but movement disorder monitoring may rely on peak frequency, frequency-band power, gait periodicity, activity classification, tremor scores, and other digital biomarkers. A small RMS error does not necessarily mean that these downstream metrics are valid. The authors should clearly describe this limitation and avoid implying validity of clinical digital biomarkers. - Explain the clinical meaning of the 6% tolerance threshold.
The 6% threshold may be appropriate as an engineering standard, but its clinical relevance is not established. The authors should distinguish between engineering tolerance and clinically meaningful measurement error. They should also discuss how measurement error may affect symptom classification, longitudinal change, intervention effects, and between-group comparisons. - Interpret uncertainty intervals more cautiously.
The manuscript reports that point estimates for the selected smartwatch were within the 6% range, but the uncertainty intervals exceeded this range. This substantially weakens the conclusion that the device is valid or closely aligned with the reference accelerometer. The conclusions should reflect the uncertainty analysis, not only the point estimates. - Present results for all devices in the main text.
The manuscript focuses mainly on the smartwatch with the smallest error. Because the aim is to evaluate multiple smartwatches simultaneously, the main text should compare all devices. The authors should provide a table showing mean error, maximum error, RMSE, expanded uncertainty, and the number or proportion of test conditions within the tolerance range for each device. - Ensure complete reporting for all five smartwatches.
The results for all five devices should be presented in the same format. If any device was excluded, failed, or had incomplete data, the reason should be clearly stated. - Address the lack of reproducibility testing.
The study includes three short-term repetitions for each condition, but it does not assess within-day reproducibility, between-day reproducibility, reproducibility after remounting, or the effects of firmware, battery level, and operating conditions. These limitations should be clearly acknowledged. - Revise the conclusions to match the evidence.
The study supports the feasibility of simultaneous bench-based evaluation of smartwatch accelerometers under controlled vibration conditions. It does not establish validity in patients, free-living settings, or clinical movement disorder assessment. The conclusions should be restricted to the methodological value of the bench-based approach and should state that further validation is required before clinical use.
Overall, this is a potentially useful methodological study, but the current manuscript overstates the implications of the data. Major revision is needed, particularly regarding the framing of the study, terminology, device reporting, completeness of results, interpretation of uncertainty, and claims about clinical application.
Author Response
This manuscript proposes a method for simultaneously evaluating accelerometers embedded in multiple consumer-grade smartwatches on the same vibration table. The topic is relevant, particularly for large-scale studies that use multiple wearable devices. However, the manuscript does not adequately distinguish between bench-based physical accuracy testing and clinical validity for movement disorder monitoring. The abstract and conclusions make strong claims about applications to Parkinson’s disease, movement disorders, and healthcare, but the study does not include patients, human wearing conditions, free-living settings, clinical symptoms, or disease severity. The manuscript should therefore be reframed as a methodological study of simultaneous bench-based evaluation of smartwatch accelerometers. The following major issues should be addressed.
- Clarify the study objective and align the conclusions with the data. The data is limited to controlled vibration-table experiments. However, the abstract and conclusions imply relevance to movement disorder monitoring and healthcare applications. The authors should define the main objective as the development or demonstration of a simultaneous bench-based evaluation method for smartwatch accelerometers. Any clinical implications should be stated cautiously as potential future applications.
Thank you for this observation. We would like to clarify that the primary objective of this study is the development and demonstration of a simultaneous bench-based evaluation methodology for smartwatch accelerometers, and not the clinical validation of any specific application. The clinical context is introduced solely to motivate the need for such a methodology, given that smartwatches are increasingly being deployed in movement disorder monitoring research. Any reference to clinical implications in the manuscript is intended to highlight the context and the potential future applicability of the proposed framework, not to claim direct clinical validity based on the present experimental data. We have carefully revised the document to clearly identify the main objective. Several changes already incorporated in the revised manuscript address this concern directly.
The Abstract (lines 8–10) has been updated to replace the previous formulation linking the test conditions specifically to Parkinson's disease with a broader framing:
“[…] covering motion characteristics relevant to a broad range of movement disorders. […]".
Furthermore, the last sentence of the Abstract has been revised to remove the specific reference to PD symptom monitoring and to frame the findings as a methodological contribution rather than a clinical validation.
The Conclusions (lines 580-582) have been revised to explicitly acknowledge the limitations of the study scope:
"[…] These findings should therefore be interpreted with caution, and the proposed methodology should be understood as a metrological characterization tool rather than a definitive validation of device accuracy for clinical applications. […]”
- Distinguish “calibration” from “accuracy assessment”. The manuscript should clarify whether the procedure is intended as comparison calibration, metrological characterization, or post hoc correction. If the term calibration is retained, the authors should specify that no device-specific correction equations were derived or applied, unless such correction procedures are added.
Thank you for this observation. We agree that it is important to clearly describe the scope and nature of the procedure presented in the manuscript. The methodology proposed in this study is intended to verify the correct functioning of the smartwatch accelerometers in terms of amplitude and frequency response, by comparison with a calibrated reference accelerometer under controlled vibration conditions. To avoid any ambiguity, we have added the following clarifying statement to Section 2 in lines 162-166:
"[…] The proposed methodology is intended to verify the amplitude and frequency response of each smartwatch accelerometer by comparison with a calibrated reference sensor under controlled vibration conditions. It should be noted that no correction equations based on the results of this verification have been derived or applied to the raw sensor output. […]"
- Provide a clearer rationale for device selection. The criteria for selecting the five Wear OS smartwatches are insufficiently described. The authors should report the selection criteria, exclusion criteria, operating system version, firmware version, sensor API, actual sampling rate, and missing data rate. Because smartwatch performance depends not only on the accelerometer but also on the operating system, firmware, application, and power management, these details are essential for assessing generalizability.
We thank the reviewer for this comment. We agree that additional methodological and technical details are necessary to improve reproducibility and allow a more accurate assessment of generalizability. In the revised manuscript, we have expanded the description of the device selection criteria and included detailed technical specifications for each smartwatch, including operating system version, firmware version, sensor API, sampling characteristics, and data completeness. Regarding data quality, we now explicitly report that the signal acquisition application used in this study had been previously characterized and demonstrated a limited data loss of approximately 2% of recorded samples. These additions have been incorporated into the revised manuscript as follows, in section 2.3, in lines 292-300 and 303-312:
“[…] Device selection was guided by predefined criteria, including: (i) commercial availability in 2024, (ii) integration of a triaxial accelerometer accessible through the standard Android Sensor API (SensorManager), (iii) broad market adoption, and (iv) stable support for continuous data acquisition during prolonged recording sessions. Devices were excluded if they lacked documented sensor APIs, exhibited unstable firmware behaviour, or did not allow access to raw accelerometer data streams.
All devices were running Wear OS version 4.x or higher, with firmware versions corresponding to the latest stable releases available at the time of data collection. […]”
“[…] A custom Wear OS application was deployed on each device to collect triaxial accelerometer signals via the Android SensorManager API. Data were recorded at a nominal sampling rate of 50 Hz, which provides sufficient temporal resolution for the frequency range of interest, in accordance with the Nyquist theorem. The maximum available sampling frequency of the devices is several orders of magnitude higher than the selected sampling rate, ensuring that the acquisition process operates well within hardware capabilities and is not constrained by sensor-level sampling limits. […]”
“[…] The signal acquisition application had been previously characterized and showed a data loss of approximately 2% of recorded samples. The captured signal was stored locally on each device for subsequent extraction and analysis […]”
- Clarify whether only one unit per model was tested. The manuscript does not clearly state how many units were tested for each smartwatch model. If only one unit per model was tested, the results reflect the performance of specific individual devices, not the performance of each model as a whole. This limitation should be clearly stated. Ideally, the authors should test multiple units of the same model and quantify within-model variability.
Thank you for this important observation. We would like to clarify that the concept of within-model variability is not directly relevant in the context of the proposed methodology. Metrological characterization should be performed at the individual device level precisely because two units of the same model cannot be assumed to behave identically, due to manufacturing tolerances, firmware differences, or sensor-level variability. The proposed methodology is therefore designed to characterize each individual device independently, and its validity is demonstrated by the fact that five units with different hardware configurations were successfully validated simultaneously under identical conditions. Generalizing results to a model level would be contrary to the purpose of individual device characterization itself, which aims to establish the metrological traceability of each specific unit used in a study. Furthermore, we would like to note that the specific commercial models tested could potentially be identified from the technical specifications reported in Table 2, such as dimensions, weight, and accelerometer chip. However, the authors deliberately avoided model-level identification and comparison, as drawing conclusions about the performance of specific commercial brands falls outside the scope of this work and could introduce a commercial bias that is not relevant to the methodological contribution being presented. The manuscript has been updated to make this point more explicit, reinforcing the idea that two units of the same model cannot be assumed to be metrologically equivalent, and that individual device characterization is therefore an inherent requirement of any rigorous measurement validation procedure. The following text has been added to the Discussion section, in lines 528-535:
“[…] It should be noted that only one unit per smartwatch model was tested in this study. However, this does not constitute a limitation of the proposed methodology, since metrological characterization is inherently an individual-device procedure. Two units of the same model cannot be assumed to be metrologically equivalent, as manufacturing tolerances, firmware versions, and sensor-level variability may introduce differences in measurement performance between units. The proposed framework is precisely intended to address this issue by providing a tool to characterize each individual device before deployment in a research or clinical study, regardless of its model. […]”
- Verify the assumption of identical input conditions during simultaneous testing.
Placing multiple devices on the same vibration table does not necessarily ensure that each device receives the same acceleration input. Local vibration differences, phase differences, rotational components, device position, and mounting conditions may affect the results. The authors should evaluate and report the spatial uniformity of the vibration input and describe the device placement and mounting procedure in detail.
We thank the reviewer for this insightful comment. We agree that placing multiple devices on a common vibration platform does not inherently guarantee identical input conditions, and that spatial variability, phase differences, rotational components, device position, and mounting conditions may potentially affect the recorded signals. To address this concern, prior to measurements, a modal analysis of the vibration table used for simultaneous device placement was conducted to evaluate potential spatial non-uniformities and vibration-related artifacts. Although, based on the geometry, density, and material properties of the support, the fundamental vibration frequency was expected to lie above the frequency range of interest of this study, the analysis was performed to explicitly verify the absence of structural effects that could bias the measurements. The modal analysis showed that the first resonance frequency of the structure was 56 Hz. Given that the acquisition sampling rate was 50 Hz and that the analysis focuses on lower-frequency content, this result indicates that structural resonances do not interfere with the measured signals within the operational bandwidth. These results support the assumption of sufficiently uniform input conditions across devices during simultaneous testing.
We have modified the text to add this information, in section 2.2 Experimental Protocol, in lines 232--247:
“[…] To ensure consistent input conditions across devices during simultaneous testing, all smartwatches were mounted on a common vibration platform under identical experimental conditions. A dedicated structure was carefully designed for this study, considering factors such as material stiffness, mass distribution, and geometric dimensions, as these parameters govern the dynamic behaviour and natural frequencies of the system. The objective was to ensure that the structure's resonant modes remained outside the frequency range of interest. A modal analysis of the supporting structure was conducted to evaluate potential spatial non-uniformities, local vibration effects, and structural resonances that could influence the recorded signals. Although, based on the geometry, density, rigidity, and material properties of the support, the fundamental resonance frequency was expected to lie above the frequency range of interest, the analysis confirmed that the first resonance frequency of the structure was 56 Hz. Considering that data were acquired at a sampling rate of 50 Hz and that the analysis focuses on lower-frequency content, structural resonances are not expected to affect the recorded measurements within the operational bandwidth. These results support the assumption of sufficiently uniform input conditions across devices during simultaneous testing. […]”
- Justify the selected frequency and acceleration ranges for movement disorder monitoring.
The manuscript states that the tested range of 1–8 Hz and 1–4 m/s² is relevant to Parkinson’s disease and movement disorders. However, the target symptom is not clearly specified. Tremor, gait, dyskinesia, freezing of gait, and upper-limb activity require different frequency ranges, acceleration ranges, and outcome metrics. The authors should explain which clinical application is being targeted and justify the selected test conditions accordingly.
Thank you for this observation. We would like to clarify that the proposed methodology is not intended to target any specific movement disorder symptom, but rather to provide a general simultaneous calibration framework applicable to movement disorder monitoring large-scale studies in which multiple smartwatch devices need to be validated concurrently under identical conditions.
The frequency range of 1–8 Hz and acceleration amplitudes of 1–4 m/s² are determined by the operational limits of the seismic table. Notably, this working range encompasses the typical frequency and amplitude intervals commonly associated with movement disorder manifestations, including tremor, gait alterations, dyskinesia, and freezing of gait. However, for application-specific clinical scenarios, the relevant amplitude and frequency ranges should be individually characterized, and dedicated experimental configurations should be designed accordingly. We have clarified this point in the revised manuscript by adding the following text to section 2.2, in lines 205-215:
“[…] The frequency range of 1–8 Hz and acceleration amplitudes of 1–4 m/s², defined by the operational limits of the seismic table, are well-suited to cover the motion characteristics typically associated with different movement disorders. Normal and abnormal gait patterns typically lie in the frequency range of 0.8–1.5 Hz , tremors are observed in the range of 3.5–7.5 Hz, and bradykinesia data generally occur between 1–6 Hz. Regarding amplitude, the movement of the arm during walking generates a signal with a maximum acceleration of approximately 4 m/s² . The tested conditions therefore constitute a representative range for a general-purpose calibration protocol applicable to movement disorder monitoring. Researchers targeting a specific clinical application should nonetheless verify that the excitation parameters adequately cover the characteristic frequency and amplitude ranges of the disorder of interest. […]”
We also acknowledge that this mismatch was partly introduced by an inconsistency in the manuscript itself, as the Abstract retained a previous formulation that explicitly linked the test conditions to Parkinson's disease tremor specifically, which reflected an earlier version of the manuscript focused solely on PD. As the scope was broadened during internal initial revision to encompass a more general calibration framework, the main text was updated accordingly, but the Abstract was inadvertently left in its earlier form. This has now been corrected in the revised manuscript. We upload the abstract according to it:, in abstract (lines 7-10)
“[…] The proposed calibration system employs a seismic table to generate controlled vibrations within a frequency range of 1–8 Hz and acceleration amplitudes between 1 and 4 m/s², covering motion characteristics relevant to a broad range of movement disorders. […]”
- Address the fact that only 12 of the planned 15 test conditions were completed. The Methods describe 15 planned combinations of frequency and amplitude, but the Results report that three conditions could not be tested because of limitations of the vibration table. This is an important limitation of the proposed method. The authors should clearly distinguish planned and completed test conditions, provide a table of untested conditions, and discuss how the inability to test low-frequency/high-amplitude conditions affects the applicability of the method.
Thank you for this observation. We agree that the discrepancy between the planned and completed test conditions required clearer handling throughout the manuscript. The three conditions that could not be completed (1 Hz at 2.5 m/s², 1 Hz at 4 m/s², and 2 Hz at 4 m/s²) were excluded due to mechanical limitations of the seismic table, which produced inconsistent or non-linear excitation at low frequencies combined with high amplitudes. We recognize that this limitation was not sufficiently explained in the original manuscript, and that its implications for the clinical validity of the methodology, particularly in the context of movement disorder monitoring, deserved further discussion. Moreover, we acknowledge that the practical implementation of this calibration framework for specific motor disorders requires careful selection of an excitation device capable of reliably covering the full amplitude and frequency range characteristic of the target condition. To address all these points, the following changes have been made to the manuscript.
To address this, we have added the following text to the Methods section, in lines 216-220, and completed with Table 1.
“[…] It should be noted that, due to mechanical limitations of the seismic table at low frequencies combined with high amplitudes, three of the fifteen planned conditions could not be completed: 1 Hz at 2.5 m/s², 1 Hz at 4 m/s², and 2 Hz at 4 m/s². At these conditions, the table produced inconsistent or non-linear excitation. The experimental validation was therefore carried out over 12 amplitude–frequency combinations, as detailed in Table 1. […]”
Also, we have added the following text in the Results section, in lines 401-403, and in the Discussion section, in lines 476-487:
“[…] As shown, 12 of the 15 planned amplitude–frequency combinations are presented, as the three conditions that could not be completed due to seismic table limitations were described in Section 2.2. […]”
“[…] The three excluded amplitude–frequency combinations (1 Hz at 2.5 m/s², 1 Hz at 4 m/s², and 2 Hz at 4 m/s²) correspond to low-frequency, high-amplitude conditions that may be clinically relevant in certain movement disorder contexts. While the proposed methodology is intended as a general simultaneous calibration framework, its practical implementation for specific motor disorders requires careful consideration of the characteristic frequency and amplitude ranges associated with each condition. When deploying this calibration approach for a particular clinical application, it is therefore recommended to select an excitation device capable of reliably covering the full amplitude and frequency range characteristic of the target disorder, avoiding the mechanical limitations observed here at low-frequency, high-amplitude combinations. Future work should explore alternative vibration platforms with extended linear operating ranges to ensure complete validation coverage across all clinically meaningful conditions. […]”
- Do not generalize from the z-axis alone to the full triaxial accelerometer. The main analysis focuses on the z-axis. However, smartwatches are worn on the wrist, and the device coordinate system does not remain fixed relative to the body during daily activities. For movement disorder monitoring, all three axes, cross-axis sensitivity, and vector magnitude may be important. The authors should either provide results for all axes or clearly state that the study evaluates only one axis.
Thank you for your comment. We agree that the original manuscript did not sufficiently limit its claims to the evaluated axis, and that references to triaxial capabilities required either supporting evidence or more careful scoping. In response to this comment, we have taken the following actions.
First, the term 'triaxial' has been removed from the abstract, ensuring that the scope of the experimental validation is accurately represented from the outset. Second, the claims regarding the extension of the methodology to the remaining axes have been reframed throughout the manuscript as a methodological possibility rather than a validated result, making clear that the present study is restricted to the z-axis. The efficiency advantages of simultaneous calibration for triaxial characterization are discussed as a theoretical projection, not as an empirically demonstrated outcome.
To address this, the following text has been added to the Methods section, in lines 281-287:
“[…] The present analysis focuses on the z-axis, which received the primary excitation from the seismic table. Since most consumer-grade smartwatches incorporate triaxial accelerometers, the proposed methodology could be straightforwardly extended to the remaining axes, which would further amplify the time-efficiency advantages of simultaneous calibration, whereas traditional individual calibration would require 135 tests per device per full triaxial characterization, the proposed approach would maintain a fixed total test count regardless of the number of devices evaluated. […]”
The following clarification has been added to the Results section, in lines 387-389:
“[…] The experimental evaluation focuses on the z-axis, which received the direct excitation from the seismic table. […]”
Finally, the following text has been added to the Discussion section, in lines 520-527:
“[…] In this work, the analysis was restricted to the z-axis, which received the direct excitation from the seismic table. Although most consumer-grade smartwatches incorporate triaxial accelerometers, the calibration of additional axes would follow an identical procedure. In this case, the efficiency gains of simultaneous calibration would be even more pronounced, as the total number of tests remains constant regardless of the number of devices, making the approach particularly attractive for large-scale triaxial characterization. Because of this, future work should explore the effects and applicability of the proposed methodology for different axis. […]”
- Acknowledge that RMS error alone does not establish validity of clinical metrics.
The study evaluates RMS acceleration error, but movement disorder monitoring may rely on peak frequency, frequency-band power, gait periodicity, activity classification, tremor scores, and other digital biomarkers. A small RMS error does not necessarily mean that these downstream metrics are valid. The authors should clearly describe this limitation and avoid implying validity of clinical digital biomarkers.
The proposed methodology is intentionally scoped as a metrological characterization of the sensor itself, not a validation of any specific clinical biomarker. The RMS-based error and uncertainty quantification provides a foundational characterization of the sensor's measurement performance under controlled conditions, which is a necessary, though not sufficient, prerequisite for clinical validity. A more meaningful connection between sensor-level calibration and biomarker-level validity can be established through uncertainty propagation. If a clinical biomarker is defined as a mathematical function of the raw acceleration signal, the measurement uncertainty characterized here could be formally propagated to that biomarker using the law of propagation of variances, as defined in ISO/IEC Guide 98-3 (GUM), already adopted in this study. This would allow researchers to estimate the contribution of sensor uncertainty to the total uncertainty of a specific digital biomarker. To reinforce this idea, the following text has been added to the Discussion section, in lines 499-508:
“[…] It should be noted that this study evaluates sensor performance exclusively in terms of RMS acceleration error. Movement disorder monitoring typically relies on downstream digital biomarkers, such as peak frequency, frequency-band power, gait periodicity, or tremor scores, whose validity cannot be directly inferred from a low RMS error. The proposed calibration framework should therefore be understood as a necessary metrological foundation, rather than a validation of specific clinical metrics. For studies targeting a particular biomarker, the uncertainty characterization provided here could serve as a basis for formal uncertainty propagation from the sensor level to the biomarker level, following the ISO-GUM framework, allowing a principled estimation of how sensor measurement uncertainty contributes to the total uncertainty of the derived clinical metric. […]”
- Explain the clinical meaning of the 6% tolerance threshold.
The 6% threshold may be appropriate as an engineering standard, but its clinical relevance is not established. The authors should distinguish between engineering tolerance and clinically meaningful measurement error. They should also discuss how measurement error may affect symptom classification, longitudinal change, intervention effects, and between-group comparisons.
Thank you for this important comment. We fully agree that the relationship between the indication error and the expanded uncertainty requires careful clarification. In the original manuscript, the 6% threshold borrowed from ISO 8041-1:2017 was used as an indicative engineering reference in the absence of a dedicated normative framework for wearable accelerometers in clinical applications. We acknowledge that this threshold was not originally designed for consumer-grade wearable devices, and that the expanded uncertainty intervals exceeding this boundary at higher excitation amplitudes cannot be directly interpreted as device failure within a pass/fail judgment. Rather, this reflects the inherent variability of consumer-grade sensors and the limitations of applying an engineering standard outside its intended scope. We recognize that a rigorous acceptance rule would require a dedicated metrological standard for wearable accelerometers in healthcare, which does not currently exist. To address this point, we have clarified the interpretation of the uncertainty analysis throughout the manuscript and explicitly acknowledged this limitation. The following text has been added to the Results section, in lines 420-424 and in the Discussion Section, in lines 536-550:
“[…] To provide a reference framework for evaluating measurement error, the 6% tolerance threshold defined in ISO 8041-1:2017 for general-purpose vibration meters was adopted. Although this standard is not directly applicable to healthcare contexts, it was used here solely as a reference benchmark in the absence of a specific normative framework for wearable accelerometers in clinical applications. […]”
“[…] A broader limitation of the current evaluation concerns the absence of a dedicated normative framework for the metrological validation of wearable accelerometers in healthcare applications. The 6\% tolerance threshold adopted in this study is borrowed from ISO 8041-1:2017, a standard designed for general-purpose vibration meters used in human vibration response assessment, which is not directly applicable to clinical or research wearable contexts. This threshold was therefore used solely as an indicative engineering reference to provide a basis for evaluating measurement error. The fact that expanded uncertainty intervals exceed this boundary at higher excitation amplitudes does not necessarily imply device failure but rather reflects the inherent variability of consumer-grade sensors and the limitations of applying an engineering standard outside its intended scope. The authors wish to emphasize the need to develop specific metrological standards for portable accelerometers in clinical and research settings, as this remains a significant unmet need in this field. Such standards would need to define not only tolerance thresholds appropriate for specific clinical applications, but also reproducibility requirements, wearing condition specifications, and uncertainty reporting guidelines tailored to healthcare contexts. […]”
- Interpret uncertainty intervals more cautiously. The manuscript reports that point estimates for the selected smartwatch were within the 6% range, but the uncertainty intervals exceeded this range. This substantially weakens the conclusion that the device is valid or closely aligned with the reference accelerometer. The conclusions should reflect the uncertainty analysis, not only the point estimates.
Thank you for this observation. We agree that conclusions should reflect the full uncertainty analysis rather than point estimates alone. The revised manuscript now explicitly acknowledges that, while point estimates generally fall within the 6% reference threshold, the expanded uncertainty intervals exceed this boundary, particularly at higher excitation amplitudes. This distinction is now clearly reflected in the conclusions, which have been revised to avoid implying unconditional device validity.
It should also be noted that this behaviour is consistent with the nature of consumer-grade sensors and with the application of an engineering threshold outside its intended scope, as discussed in response to the previous comment. The uncertainty intervals are proportional to the excitation amplitude, meaning that at higher amplitudes the absolute uncertainty grows accordingly, which is an expected metrological behaviour rather than necessarily indicative of poor device performance. To address this comment, the conclusions have been revised to reflect this more cautious interpretation, in lines 572-585, in conclusions section:
“[…] This study presents a methodology for calibrating smartwatch accelerometers using a seismic table as the excitation source, under controlled laboratory conditions. The proposed calibration approach, based on a comparison with a reference accelerometer, enables the simultaneous evaluation of multiple smartwatch accelerometers under identical mechanical vibration conditions. The results demonstrate that consumer-grade smartwatches record acceleration measurements closely aligned with those of the reference accelerometer under bench-based testing, in terms of point estimates. However, the expanded uncertainty intervals exceed the 6% reference threshold adopted in this study, particularly at higher excitation amplitudes. These findings should therefore be interpreted with caution, and the proposed methodology should be understood as a metrological characterization tool rather than a definitive validation of device accuracy for clinical applications. […]”
- Present results for all devices in the main text.
The manuscript focuses mainly on the smartwatch with the smallest error. Because the aim is to evaluate multiple smartwatches simultaneously, the main text should compare all devices. The authors should provide a table showing mean error, maximum error, RMSE, expanded uncertainty, and the number or proportion of test conditions within the tolerance range for each device.
Thank you for this suggestion. As described in the response to the next comment, the indication error and expanded uncertainty figures for all five smartwatches have been moved to the main Results section, accompanied by descriptive text for each device. Specifically, for each smartwatch, the revised text explicitly reports the range of percentage indication errors across all tested amplitude–frequency combinations, identifies which specific conditions fall outside the ±6% reference threshold, and discusses the observed frequency- and amplitude-dependent trends. For instance, the text describes how Smartwatch 3 showed the best overall performance with all test conditions within the valid range and errors between -4.2% and 4.8%, while Smartwatch 4 exhibited the largest deviations, with two conditions reaching errors of 8% and 18% at 1 m/s² excitation. For the remaining devices, the text details which frequency and amplitude combinations produced out-of-range values and discusses the underlying behaviour, such as the systematic overestimation observed in Smartwatches 1 and 2 or the absence of a clear trend in Smartwatch 5 at intermediate amplitudes. Furthermore, the expanded uncertainty intervals are discussed for each device in relation to the reference threshold, which is explicitly acknowledged as a limitation and addressed separately in the manuscript (comments 10 and 11). With these new figures and texts, we consider that the revised Results section provides a comprehensive and transparent characterization of the measurement performance across all tested devices, covering the relevant error metrics and uncertainty ranges in a format that preserves the full frequency and amplitude dependent information that would necessarily be lost if condensed into a summary table.
- Ensure complete reporting for all five smartwatches. The results for all five devices should be presented in the same format. If any device was excluded, failed, or had incomplete data, the reason should be clearly stated.
Thank you for this comment. We fully agree that presenting the indication error and uncertainty analysis for only one smartwatch was insufficient to support the key claim of simultaneous multi-device calibration and could introduce a selection bias in the interpretation of the results. To address this, the figures showing the indication error and expanded uncertainty intervals for all five evaluated smartwatches (no devices were excluded), which were previously included as supplementary material, have been moved to the main Results section, accompanied by the corresponding descriptive text discussing the observed behaviour of each device. These additions provide a more complete and balanced characterization of the measurement performance across all tested devices, thereby strengthening the evidence supporting the proposed calibration methodology.
- Address the lack of reproducibility testing. The study includes three short-term repetitions for each condition, but it does not assess within-day reproducibility, between-day reproducibility, reproducibility after remounting, or the effects of firmware, battery level, and operating conditions. These limitations should be clearly acknowledged.
We thank the reviewer for highlighting the importance of reproducibility and long-term operational stability in wearable sensing applications. We agree that the present study focuses on short-term calibration repeatability under controlled laboratory conditions and does not include a comprehensive assessment of within-day reproducibility, between-day reproducibility, remounting effects, firmware updates, battery level variations, or other operational factors that may influence accelerometer measurements over extended use.
Regarding signal continuity, we agree that, beyond calibration accuracy, intermittent sensor anomalies and data discontinuities are highly relevant considerations in clinical movement monitoring applications, where missing or discontinuous data may affect the reliability of long-term assessments and symptom characterization. The application used for signal acquisition in this study is a registered Wear OS application developed by the authors. Previous characterization of the application confirmed that data loss is limited to approximately 2% of the total recorded data, ensuring an adequate level of signal continuity during the measurement sessions performed in this work.
Following the reviewer’s suggestion, we have incorporated an additional discussion paragraph, in lines 509-519, acknowledging these reproducibility limitations and emphasizing the need for future studies evaluating long-term reproducibility and operational stability under different usage conditions.
“[…] The present study evaluates short-term repeatability through three repetitions for each amplitude–frequency condition following ISO 8041-1:2017 recommendations. However, additional reproducibility aspects were not assessed, including within-day and between-day reproducibility, remounting variability, firmware-related effects, battery level influence, and other operational conditions that may affect accelerometer performance during prolonged real-world use. These factors may introduce additional variability in wearable sensing applications and should therefore be systematically evaluated in future work. Furthermore, although the signal acquisition application employed in this study had been previously characterized and showed data loss limited to approximately 2% of the recorded samples, intermittent signal discontinuities and sensor anomalies remain important considerations for long-term clinical monitoring applications. […]”
- Revise the conclusions to match the evidence. The study supports the feasibility of simultaneous bench-based evaluation of smartwatch accelerometers under controlled vibration conditions. It does not establish validity in patients, free-living settings, or clinical movement disorder assessment. The conclusions should be restricted to the methodological value of the bench-based approach and should state that further validation is required before clinical use.
We thank the reviewer for this important observation. We agree that the present study evaluates smartwatch accelerometers exclusively under controlled laboratory vibration conditions using a seismic test bench and therefore does not establish clinical validity in patients or free-living environments. Following the reviewer’s suggestion, a new paragraph has been incorporated into the Discussion section explicitly stating that further validation under real-world and clinical conditions is required before the proposed methodology can support clinical movement disorder assessment or healthcare deployment, in lines 551-562:
“[…] Although the proposed methodology demonstrates the feasibility of simultaneous smartwatch accelerometer calibration under controlled laboratory vibration conditions, the present study does not evaluate device performance in patients or free-living environments. Real-world movement disorder monitoring involves additional sources of variability, including sensor placement, motion artifacts, daily activity patterns, patient-specific movement characteristics, and potential signal discontinuities, which are not reproduced in the current bench-based setup. In this context, beyond measurement accuracy, signal continuity is also a critical factor in clinical movement monitoring, as intermittent data loss may compromise the reliability of long-term assessments. Therefore, further validation studies involving clinical populations and ecological monitoring conditions are required before the proposed methodology can support clinical assessment or healthcare decision-making applications. […]”
Overall, this is a potentially useful methodological study, but the current manuscript overstates the implications of the data. Major revision is needed, particularly regarding the framing of the study, terminology, device reporting, completeness of results, interpretation of uncertainty, and claims about clinical application.
Author Response File:
Author Response.pdf
Round 2
Reviewer 3 Report
Comments and Suggestions for AuthorsI appreciate the authors’ efforts to revise the manuscript and to respond to the previous comments. The revised version is clearer than the original manuscript, particularly in acknowledging that the study is a bench-based metrological evaluation rather than a clinical validation study. The additional discussion on RMS error, uncertainty, incomplete test conditions, and reproducibility limitations is also helpful.
However, several important concerns remain. The manuscript still tends to overstate the clinical relevance of the findings, particularly in the title, abstract and conclusion. The terminology around “calibration” remains potentially misleading because no device-specific correction equations were derived or applied. In addition, the manuscript still does not adequately address the implications of testing only one unit per smartwatch model, the assumption of identical input conditions across device positions, the lack of a concise summary table for all devices, and inconsistencies between the main manuscript and supplementary material. These issues should be resolved before the manuscript can be considered further.
Major comments
- The title, abstract, and conclusion should be further revised to avoid overstating clinical applicability. Although the revised manuscript now acknowledges that this is a bench-based metrological characterization study, the title still implies accurate measurement in movement disorder monitoring. The abstract also refers to deployment in “clinical studies involving movement disorder monitoring.” This wording remains too strong because the study did not include patients, human wearing conditions, free-living measurements, clinical symptoms, or disease severity. The title and abstract should be aligned with the actual scope of the study. A more appropriate title would refer to “bench-based metrological characterization” or “simultaneous bench-based evaluation” of smartwatch accelerometers.
- The use of the term “calibration” remains insufficiently justified. The authors have clarified that no correction equations were derived or applied to the raw sensor outputs. This clarification is useful. However, the manuscript still uses “calibration” as the central term in the title, keywords, abstract, and throughout the text. If no post hoc correction, calibration coefficient, or calibrated output is provided, the procedure is better described as “metrological characterization,” “accuracy assessment,” or “comparison-based evaluation.” If the authors retain the term “calibration,” they should explicitly state in the title or abstract that this is a comparison-based verification procedure and not a correction-based calibration method.
- The limitation of testing only one unit per model should be acknowledged more appropriately. The authors state that testing one unit per model is not a limitation because metrological characterization is inherently an individual-device procedure. This response is only partly acceptable. Individual-device characterization is indeed important, but the manuscript still refers to “smartwatch models,” and readers may interpret the findings as model-level comparisons. With only one unit per model, the study cannot estimate within-model variability and cannot generalize the observed performance to each commercial model. The authors should explicitly acknowledge this as a limitation and state that the findings apply only to the specific devices tested, not to the performance of each model as a whole.
- The assumption of identical input conditions across devices requires stronger support. The authors added a modal analysis showing that the first resonance frequency of the supporting structure was 56 Hz. This is useful, but it does not directly demonstrate that all smartwatch positions received the same acceleration input. The key concern is spatial uniformity of the vibration input across the platform, including local amplitude differences, phase differences, rotational components, and position-dependent effects. The authors should provide direct evidence of spatial uniformity, such as measurements with the reference accelerometer at multiple device positions or a position-rotation test. If such data are unavailable, the manuscript should clearly state that spatial uniformity was inferred but not directly verified.
- The discussion of the 56 Hz resonance frequency should be revised. The current explanation links the 56 Hz resonance frequency with the 50 Hz sampling rate and the lower-frequency analysis band. This could be confusing because 56 Hz is above the Nyquist frequency for a 50 Hz sampling rate. The authors should avoid implying that the sampling rate itself supports the absence of resonance effects. The relevant argument should be that the resonance frequency lies well above the 1–8 Hz excitation range. Any potential aliasing or higher-frequency structural vibration should also be considered or explicitly ruled out.
- The manuscript should include a concise summary table for all devices. The authors moved the device-specific figures into the main Results section, which improves transparency. However, this does not replace the need for a summary table. A table can complement the figures without losing frequency- and amplitude-specific information. The authors should provide, for each device, at least the mean absolute error, maximum absolute error, RMSE, range of percentage error, mean or range of expanded uncertainty, and the number or proportion of completed test conditions within the ±6% threshold. This table would substantially improve readability and allow readers to compare devices efficiently.
- The supplementary material should be updated or removed for consistency. The revised response states that the figures for all five smartwatches were moved to the main manuscript. However, the submitted supplementary material appears to retain the previous structure and does not clearly report all five devices in a consistent format. In particular, Smartwatch 3 is not clearly presented in the supplementary file. The authors should either update the supplementary material to match the revised manuscript or remove it if the relevant figures have been fully transferred to the main text.
- The manuscript should provide more detailed information on sampling performance and data loss. The authors now state that the data acquisition application had approximately 2% data loss. This is useful, but insufficient. The authors should report whether this data loss varied by device or test condition. They should also provide the actual sampling rate, sampling interval variability, expected and observed sample counts, and the maximum gap in recording. These details are important because irregular sampling and data loss can affect frequency-domain analyses and RMS estimates.
- The z-axis limitation should be stated more conservatively. The manuscript now clarifies that only the z-axis was evaluated. This is appropriate. However, the authors still state that the method could be straightforwardly extended to the remaining axes and that calibration of additional axes would follow an identical procedure. This may be too strong because mounting, gravity, cross-axis sensitivity, and device orientation may differ across axes. The authors should state more cautiously that extension to other axes is plausible but remains to be empirically verified.
- The clinical relevance of the selected frequency and amplitude ranges should be presented more cautiously. The added explanation on gait, tremor, bradykinesia, and arm movement is helpful. However, the phrase “broad range of movement disorders” remains broad. The tested range may not fully cover all clinically relevant movement patterns, especially low-frequency/high-amplitude movements, dyskinesia, freezing of gait, transitions, or non-sinusoidal free-living movements. The authors should frame the selected conditions as a pragmatic bench-test range rather than a comprehensive movement-disorder monitoring range.
- The clinical implications of measurement error should be discussed more concretely. The manuscript now acknowledges that the 6% threshold is an engineering reference rather than a clinically established threshold. However, the implications for clinical and epidemiological studies remain underdeveloped. The authors should discuss how device-related measurement error may affect exposure or outcome classification, longitudinal change estimates, intervention effects, between-group comparisons, and device replacement in longitudinal studies. This would strengthen the relevance of the paper for large-scale research.
- The conclusion should be further restricted to the evidence generated in this study. The conclusion should state that the study demonstrates the feasibility of simultaneous bench-based characterization of selected smartwatch accelerometers under controlled vibration conditions. It should not imply that the method ensures clinical accuracy, suitability for patient monitoring, or validity of movement-disorder digital biomarkers. Any statement about clinical or epidemiological use should be framed as requiring further validation in human wearing conditions, free-living environments, and relevant patient populations.
Minor comments
- Please revise the keyword “Calibration” if the main terminology is changed to “metrological characterization” or “bench-based evaluation.”
- Please ensure that “smartwatch models” and “smartwatch units” are used consistently. If only one unit per model was tested, the manuscript should avoid language implying model-level inference.
- Please check whether all references to 15 test conditions have been updated to distinguish planned from completed conditions.
- Please improve grammatical issues that remain in the revised manuscript, including phrases such as “The reminder of the paper” and “for an example of use.”
- Please ensure that all figures use clear axis labels, units, and captions. The captions should define the meaning of error bars, shaded regions, and the ±6% threshold.
Author Response
I appreciate the authors’ efforts to revise the manuscript and to respond to the previous comments. The revised version is clearer than the original manuscript, particularly in acknowledging that the study is a bench-based metrological evaluation rather than a clinical validation study. The additional discussion on RMS error, uncertainty, incomplete test conditions, and reproducibility limitations is also helpful.
However, several important concerns remain. The manuscript still tends to overstate the clinical relevance of the findings, particularly in the title, abstract and conclusion. The terminology around “calibration” remains potentially misleading because no device-specific correction equations were derived or applied. In addition, the manuscript still does not adequately address the implications of testing only one unit per smartwatch model, the assumption of identical input conditions across device positions, the lack of a concise summary table for all devices, and inconsistencies between the main manuscript and supplementary material. These issues should be resolved before the manuscript can be considered further.
Major comments
- The title, abstract, and conclusion should be further revised to avoid overstating clinical applicability. Although the revised manuscript now acknowledges that this is a bench-based metrological characterization study, the title still implies accurate measurement in movement disorder monitoring. The abstract also refers to deployment in “clinical studies involving movement disorder monitoring.” This wording remains too strong because the study did not include patients, human wearing conditions, free-living measurements, clinical symptoms, or disease severity. The title and abstract should be aligned with the actual scope of the study. A more appropriate title would refer to “bench-based metrological characterization” or “simultaneous bench-based evaluation” of smartwatch accelerometers.
Thank you for this valuable comment. We agree that the title, abstract, and conclusion required further revision to accurately reflect the scope of the study. The title has been updated to explicitly frame the study as a bench-based metrological characterization:
"[…] Simultaneous bench-based metrological characterization of Smartwatches' Accelerometers for Accurate Measurement […]"
The abstract has been revised to avoid overstating clinical applicability and to align with the actual scope of the study. In particular, the reference to "clinical studies involving movement disorder monitoring" has been reframed to emphasize the bench-based and methodological nature of the contribution, and any implication of direct clinical validity has been removed. The revised abstract now reads:
"[…] Accelerometers embedded in consumer-grade smartwatches hold significant potential for research applications like health applications, but their measurement reliability is often compromised. This limitation necessitates proper characterization to ensure precision and consistency, particularly in healthcare, where accurate data is critical for patient monitoring and clinical decision-making. This study proposes a methodology for the simultaneous metrological characterization of multiple smartwatch accelerometers, enabling efficient and consistent measurement validation. The proposed methodology employs a seismic table to generate controlled vibrations within a frequency range of 1–8 Hz and acceleration amplitudes between 1 and 4 m/s². Five commercial smartwatch units were tested, collecting acceleration data at sampling rate of 50 Hz. A reference accelerometer was used to assess the accuracy of smartwatch measurements, with errors and uncertainties quantified following ISO standards. Results demonstrate that simultaneous bench-based evaluation ensures measurement consistency across devices while reducing the time required for the process. The analysis highlights variations in frequency response and amplitude accuracy across different smartwatch units, emphasizing the need for systematic metrological characterization when considering future deployment of smartwatches in large-scale research or clinical studies involving movement disorder monitoring. […]"
Furthermore, the Conclusions section has been revised in the same direction to ensure full consistency with the framing established in the title and abstract. Any statement implying direct clinical validity, suitability for patient monitoring, or validity of movement disorder digital biomarkers has been reframed as requiring further validation under human wearing conditions, free-living environments, and relevant patient populations, as detailed in the response to comment 12.
- The use of the term “calibration” remains insufficiently justified. The authors have clarified that no correction equations were derived or applied to the raw sensor outputs. This clarification is useful. However, the manuscript still uses “calibration” as the central term in the title, keywords, abstract, and throughout the text. If no post hoc correction, calibration coefficient, or calibrated output is provided, the procedure is better described as “metrological characterization,” “accuracy assessment,” or “comparison-based evaluation.” If the authors retain the term “calibration,” they should explicitly state in the title or abstract that this is a comparison-based verification procedure and not a correction-based calibration method.
Thank you for this observation. We agree that the term "calibration" requires careful justification in the context of this study, given that no correction equations or calibration coefficients were derived or applied to the raw sensor outputs. We have reviewed the manuscript and replaced the term "calibration" with "metrological characterization" wherever it referred to the procedure proposed in this study. Exceptions have been retained where the term "calibration" appears in the titles or descriptions of cited references, including ISO standards and previously published works, in which case the original terminology used by the respective authors has been preserved. These changes have been applied throughout the title, keywords, abstract, and main text, ensuring that the scope and nature of the proposed procedure are accurately represented from the outset.
- The limitation of testing only one unit per model should be acknowledged more appropriately. The authors state that testing one unit per model is not a limitation because metrological characterization is inherently an individual-device procedure. This response is only partly acceptable. Individual-device characterization is indeed important, but the manuscript still refers to “smartwatch models,” and readers may interpret the findings as model-level comparisons. With only one unit per model, the study cannot estimate within-model variability and cannot generalize the observed performance to each commercial model. The authors should explicitly acknowledge this as a limitation and state that the findings apply only to the specific devices tested, not to the performance of each model as a whole.
Thank you for this suggestion. We agree that the original response was only partly adequate and that the limitation regarding generalizability required more explicit acknowledgement. In response to this comment, all remaining references to "smartwatch models" have been replaced by "smartwatch units" throughout the manuscript, ensuring that the text accurately reflects that the findings pertain to specific individual devices rather than to commercial models as a whole. Furthermore, the following text has been added to the Discussion section to explicitly acknowledge this limitation, in lines 560-566:
"[…] Nevertheless, it should be acknowledged that testing only one unit per model constitutes a limitation of the present study in terms of generalizability. The findings reported in this document apply exclusively to the specific devices tested and cannot be extrapolated to the performance of each commercial model as a whole. In particular, within-model variability cannot be estimated from the present dataset, and future studies should include multiple units of the same model to assess the extent of inter-unit differences in accelerometer performance. […]"
- The assumption of identical input conditions across devices requires stronger support. The authors added a modal analysis showing that the first resonance frequency of the supporting structure was 56 Hz. This is useful, but it does not directly demonstrate that all smartwatch positions received the same acceleration input. The key concern is spatial uniformity of the vibration input across the platform, including local amplitude differences, phase differences, rotational components, and position-dependent effects. The authors should provide direct evidence of spatial uniformity, such as measurements with the reference accelerometer at multiple device positions or a position-rotation test. If such data are unavailable, the manuscript should clearly state that spatial uniformity was inferred but not directly verified.
Thank you for this suggestion. We agree that the modal analysis alone does not constitute direct evidence of spatial uniformity of the vibration input across all device positions, and that the original response did not sufficiently address this concern. In response to this comment, we explicitly acknowledge in the Discussion section that spatial uniformity was inferred from the structural design and modal analysis results rather than empirically demonstrated through multi-position reference measurements. The following text has been added to the Discussion section, in lines 567-579:
"[…] An additional limitation concerns the assumption of spatial uniformity of the vibration input across all device positions on the platform. Although the modal analysis confirmed that the first resonance frequency of the supporting structure was 56 Hz, well above the operational bandwidth of this study, this does not constitute direct evidence that all smartwatch positions received identical acceleration inputs. Local amplitude differences, phase differences, rotational components, and position-dependent effects were not directly verified through multi-position reference measurements. Spatial uniformity was therefore inferred from the structural design and modal analysis results rather than empirically demonstrated. Future implementations of this simultaneous metrological characterization methodology should include direct verification of input uniformity, such as reference accelerometer measurements at multiple positions across the platform, to provide stronger experimental support for this assumption and further consolidate the validity of the concurrent evaluation approach. […]"
- The discussion of the 56 Hz resonance frequency should be revised. The current explanation links the 56 Hz resonance frequency with the 50 Hz sampling rate and the lower-frequency analysis band. This could be confusing because 56 Hz is above the Nyquist frequency for a 50 Hz sampling rate. The authors should avoid implying that the sampling rate itself supports the absence of resonance effects. The relevant argument should be that the resonance frequency lies well above the 1–8 Hz excitation range. Any potential aliasing or higher-frequency structural vibration should also be considered or explicitly ruled out.
Thank you for this observation. We acknowledge that the original explanation was potentially misleading, as it linked the 56 Hz resonance frequency with the 50 Hz sampling rate, which is incorrect since 56 Hz lies above the Nyquist frequency for a 50 Hz sampling rate. The relevant argument has been updated based on the separation between the resonance frequency and the 1–8 Hz excitation range. The manuscript has been revised accordingly, and the following updated text has been incorporated into Section 2.2, in lines 249-255:
"[…] This value lies well above the 1–8 Hz excitation range used in this study, indicating that structural resonances are not expected to influence the recorded measurements within the operational bandwidth. Furthermore, given that the excitation is confined to the 1–8 Hz range and the bandpass filtering applied during data processing retains only the frequency content within each corresponding one-third octave band, any potential contribution from higher-frequency structural vibration to the analyzed signals is effectively excluded. […]"
- The manuscript should include a concise summary table for all devices. The authors moved the device-specific figures into the main Results section, which improves transparency. However, this does not replace the need for a summary table. A table can complement the figures without losing frequency- and amplitude-specific information. The authors should provide, for each device, at least the mean absolute error, maximum absolute error, RMSE, range of percentage error, mean or range of expanded uncertainty, and the number or proportion of completed test conditions within the ±6% threshold. This table would substantially improve readability and allow readers to compare devices efficiently.
Thank you for your comment. We agree that a summary table would substantially improve readability and facilitate cross-device comparison. A concise summary table (Table 4) has been added to the Results section, immediately after the device-specific figures, reporting for each smartwatch unit the mean absolute error, maximum absolute error, RMSE, percentage error range, mean expanded uncertainty, and the proportion of completed test conditions satisfying the ±6% criterion. The following introductory sentence has been added before the table:
"[…] Table 4 summarizes the metrological characterization results for all five smartwatch units, providing a concise comparison across devices in terms of mean absolute error, maximum absolute error, RMSE, percentage error range, mean expanded uncertainty, and the proportion of test conditions satisfying the ±6% criterion. This overview complements the device-specific figures presented above by enabling a direct cross-device comparison of measurement performance. […]"
Additionally, the descriptive text associated with Figures 6–10 has been carefully reviewed and corrected, as some inaccuracies were identified in the written descriptions. The figures themselves remain unchanged.
- The supplementary material should be updated or removed for consistency. The revised response states that the figures for all five smartwatches were moved to the main manuscript. However, the submitted supplementary material appears to retain the previous structure and does not clearly report all five devices in a consistent format. In particular, Smartwatch 3 is not clearly presented in the supplementary file. The authors should either update the supplementary material to match the revised manuscript or remove it if the relevant figures have been fully transferred to the main text.
Thank you for this observation. We confirm that all figures for the five evaluated smartwatch units have been moved to the main Results section of the revised manuscript, and that no content is intended to remain in supplementary material. The supplementary file was inadvertently retained in the submission platform and could not be removed directly through the editorial system. We have therefore contacted the editorial office to request its removal, as it no longer forms part of the revised manuscript. The relevant figures and their corresponding descriptive text are now fully integrated into the main text in a consistent format for all five devices.
- The manuscript should provide more detailed information on sampling performance and data loss. The authors now state that the data acquisition application had approximately 2% data loss. This is useful, but insufficient. The authors should report whether this data loss varied by device or test condition. They should also provide the actual sampling rate, sampling interval variability, expected and observed sample counts, and the maximum gap in recording. These details are important because irregular sampling and data loss can affect frequency-domain analyses and RMS estimates.
Thank you for this suggestion. We agree that more detailed information on sampling performance and data loss was needed. The requested metrics were computed directly from the experimental sessions reported in this study, but it should be noted that expected and observed sample counts cannot be computed precisely for the individual test sessions presented in this study, as no exact temporal reference for sampling rate computation was recorded during the bench tests. Nevertheless, the 2% data loss was obtained from extended characterization sessions, also indicated in the document, provides a representative estimate of the application's data completeness under normal operating conditions. The following text has been added to Section 2.3, in lines 318-324:
"[…] The actual mean sampling rate was 50 Hz, corresponding to one sample every 20 milliseconds, with a sampling interval variability of ±3.8 milliseconds, with a maximum inter-sample gap of 35 milliseconds. The signal acquisition application had been previously characterized during recordings exceeding 48 hours, by comparing expected and observed sample counts, and showed a data loss of approximately 2% of recorded samples. These sampling characteristics were consistent across devices and test conditions. […]"
- The z-axis limitation should be stated more conservatively. The manuscript now clarifies that only the z-axis was evaluated. This is appropriate. However, the authors still state that the method could be straightforwardly extended to the remaining axes and that calibration of additional axes would follow an identical procedure. This may be too strong because mounting, gravity, cross-axis sensitivity, and device orientation may differ across axes. The authors should state more cautiously that extension to other axes is plausible but remains to be empirically verified.
Thank you for this comment. We agree that the original formulation was too strong and that the extension of the methodology to the remaining axes cannot be assumed to follow an identical procedure without empirical verification. The manuscript has been revised accordingly, and the following updated text has been incorporated into the Discussion section, in lines 543-552:
"[…] The extension of the proposed methodology to the remaining axes is plausible but remains to be empirically verified. Factors such as device mounting orientation, gravitational component distribution, and cross-axis sensitivity may differ across axes and could influence the characterization results in ways that cannot be directly inferred from the present z-axis evaluation. If this proposal is successfully extended, the efficiency gains of simultaneous characterization would be even more pronounced, as the total number of tests remains constant regardless of the number of devices, making the approach particularly attractive for large-scale triaxial characterization. Future work should therefore empirically evaluate the applicability of the proposed methodology to the remaining accelerometer axes. […]"
- The clinical relevance of the selected frequency and amplitude ranges should be presented more cautiously. The added explanation on gait, tremor, bradykinesia, and arm movement is helpful. However, the phrase “broad range of movement disorders” remains broad. The tested range may not fully cover all clinically relevant movement patterns, especially low-frequency/high-amplitude movements, dyskinesia, freezing of gait, transitions, or non-sinusoidal free-living movements. The authors should frame the selected conditions as a pragmatic bench-test range rather than a comprehensive movement-disorder monitoring range.
Thank you for this observation. We agree that the phrase "broad range of movement disorders" was overly broad and that the tested conditions should be framed more cautiously as a pragmatic bench-test range rather than a comprehensive coverage of all clinically relevant movement patterns. The manuscript has been revised accordingly throughout, replacing overly broad formulations with more conservative language. Furthermore, the following text has been added to the Discussion section, in lines 642-657, to explicitly address the limitations of sinusoidal steady-state excitation in capturing the full complexity of motor symptoms:
"[…] Additionally, it should be pointed out that motor symptoms in movement disorders are not only characterized by the energy contained in each frequency but also by highly variable peak-to-peak amplitudes, transient events, and rapidly evolving signal morphologies, such as those observed in dyskinesia episodes or in signals with fast-onset characteristics. These features cannot be adequately captured by sinusoidal steady-state excitation at fixed amplitude and frequency combinations. Addressing this limitation would require the development of dedicated simultaneous evaluation protocols capable of reproducing more complex, non-stationary excitation profiles under controlled conditions. Future work should therefore explore excitation waveforms beyond pure sinusoidal signals, including transient, swept-frequency, or clinically representative recorded signals, as well as the applicability of complementary ISO standards that address dynamic measurement beyond steady-state sinusoidal conditions, such as those covering shock calibration (ISO 16063-13) or transient motion (ISO 16063-15), which may provide a more complete metrological framework for the evaluation of wearable accelerometers in complex motor symptom monitoring scenarios. […]"
- The clinical implications of measurement error should be discussed more concretely. The manuscript now acknowledges that the 6% threshold is an engineering reference rather than a clinically established threshold. However, the implications for clinical and epidemiological studies remain underdeveloped. The authors should discuss how device-related measurement error may affect exposure or outcome classification, longitudinal change estimates, intervention effects, between-group comparisons, and device replacement in longitudinal studies. This would strengthen the relevance of the paper for large-scale research.
Thank you for this comment. We agree that the clinical implications of device-related measurement error were insufficiently developed in the previous version of the manuscript. In response to this comment, two changes have been made. First, the description of the selected frequency and amplitude ranges in Section 2.2, in lines 205-221, has been revised to frame the tested conditions more cautiously as a pragmatic bench-test range rather than a comprehensive movement-disorder monitoring range:
"[…] The frequency range of 1–8 Hz and acceleration amplitudes of 1–4 m/s², determined by the operational limits of the seismic table, constitute a pragmatic bench-test range that partially overlaps with motion characteristics reported for some movement disorders. For reference, normal and abnormal gait patterns typically lie in the frequency range of 0.8–1.5 Hz, tremors are observed in the range of 3.5–7.5 Hz, and bradykinesia data generally occur between 1–6 Hz. Regarding amplitude, the movement of the arm during walking generates a signal with a maximum acceleration of approximately 4 m/s². However, this range should not be interpreted as a comprehensive coverage of all clinically relevant movement patterns. Other motor phenomena, such as dyskinesia, freezing of gait, postural transitions, or non-sinusoidal free-living movements, may involve frequency and amplitude characteristics that fall outside or only partially within the tested range. Furthermore, the use of sinusoidal excitation in a controlled laboratory setting does not replicate the complexity and variability of real-world movement signals. The tested conditions should therefore be understood as a pragmatic starting point for a general-purpose simultaneous characterization protocol, and researchers targeting a specific clinical application should carefully verify that the excitation parameters adequately cover the characteristic frequency and amplitude ranges of the disorder of interest. […]"
Second, the following text has been added to the Discussion section to address the concrete clinical implications of measurement error in lines 616-641,
"[…] The clinical implications of device-related measurement error deserve further consideration, particularly in the context of large-scale movement disorder research. Over recent years, there has been a growing trend in the scientific community toward quantifying motor symptoms through instrumental methods using wearable devices, with the aim of providing more objective and continuous assessments than traditional clinical evaluations. In this context, RMS acceleration is one of the most commonly used metrics for characterizing tremor amplitude and consistency. Established clinical scales such as the Unified Parkinson's Disease Rating Scale (UPDRS) have been correlated with RMS-based acceleration features derived from wearable sensors, meaning that measurement uncertainty at the sensor level may propagate into the assignment of clinical scores. For instance, uncertainty in the RMS estimate obtained from a given device could affect the estimation of derived digital biomarkers and, in studies where such biomarkers are related to clinical scales such as the UPDRS, may influence the interpretation of category-level changes, longitudinal progression, or intervention-related effects. Furthermore, in longitudinal studies where the objective is to detect change over time between an initial assessment and a follow-up measurement, the dispersion of the recorded signal has two components: one associated with the actual biological variability of the patient, and one associated with the instrumentation itself. If the measurement uncertainty is not properly characterized and accounted for, it may mask true clinical change or, conversely, be misinterpreted as meaningful progression. Similar considerations apply to between-group comparisons, where systematic differences in device performance across units could introduce bias, and to device replacement in longitudinal studies, where metrological continuity between units must be ensured. The proposed characterization framework directly addresses this need by providing a systematic tool to quantify and document the measurement uncertainty of each individual device prior to deployment, thereby allowing researchers to estimate the contribution of instrumental uncertainty to the total variability of derived clinical metrics. […]"
- The conclusion should be further restricted to the evidence generated in this study. The conclusion should state that the study demonstrates the feasibility of simultaneous bench-based characterization of selected smartwatch accelerometers under controlled vibration conditions. It should not imply that the method ensures clinical accuracy, suitability for patient monitoring, or validity of movement-disorder digital biomarkers. Any statement about clinical or epidemiological use should be framed as requiring further validation in human wearing conditions, free-living environments, and relevant patient populations.
Thank you for this suggestion. We agree that the conclusions required further restriction to the evidence generated in this study, and that any reference to clinical or epidemiological use should be framed as requiring further validation. The following changes have been made to the Conclusions section.
First, the description of the methodology as a screening tool has been revised to avoid implying clinical validity:
"[…] Furthermore, it may serve as an initial metrological screening tool to document the measurement performance of individual smartwatch units before their use in research studies. However, the present study evaluates only selected individual devices under bench-based conditions, and these findings cannot be directly extrapolated to clinical accuracy, suitability for patient monitoring, or validity of movement disorder digital biomarkers. […]"
Second, the following sentence has been added at the end of the Conclusions section to explicitly frame any clinical application as requiring further validation:
"[…] Any application of this methodology toward clinical movement disorder monitoring or the validation of digital biomarkers will require further validation under human wearing conditions, free-living environments, and relevant patient populations. […]"
These changes ensure that the conclusions are fully consistent with the scope of the evidence generated in this study, which is restricted to controlled bench-based characterization of selected individual smartwatch units under sinusoidal vibration conditions.
Minor comments
- Please revise the keyword “Calibration” if the main terminology is changed to “metrological characterization” or “bench-based evaluation.”
The keyword "Calibration" has been replaced with "Metrological Characterization" in the keyword list to ensure consistency with the revised terminology used throughout the manuscript.
- Please ensure that “smartwatch models” and “smartwatch units” are used consistently. If only one unit per model was tested, the manuscript should avoid language implying model-level inference.
The manuscript has been carefully reviewed and all remaining instances of "smartwatch models" have been replaced by "smartwatch units" to avoid any implication of model-level inference.
- Please check whether all references to 15 test conditions have been updated to distinguish planned from completed conditions.
All instances throughout the manuscript have been reviewed and updated to clearly distinguish between the 15 planned conditions and the 12 completed conditions, consistent with the clarification introduced in the previous revision.
- Please improve grammatical issues that remain in the revised manuscript, including phrases such as “The reminder of the paper” and “for an example of use.”
Thank you for pointing out these grammatical issues. The manuscript has been carefully proofread and the identified errors have been corrected. In particular, "The reminder of the paper" has been replaced by "The remainder of the paper" and "for an example of use" has been revised accordingly.
- Please ensure that all figures use clear axis labels, units, and captions. The captions should define the meaning of error bars, shaded regions, and the ±6% threshold.
Thank you for your suggestion. all captions for Figures 6–10 have been updated to provide a complete and consistent description of the plotted content. The revised caption format is as follows:
"[…] Indication error and expanded uncertainty range of smartwatch unit X at three excitation amplitudes (1, 2.5, and 4 m/s²). The x-axis represents the excitation frequency (Hz) and the y-axis represents the indication error (m/s²). Error bars represent the expanded uncertainty while the dashed lines indicate the ±6% threshold. […]"
Author Response File:
Author Response.pdf
Round 3
Reviewer 3 Report
Comments and Suggestions for AuthorsNo further comments.
